Fine-tune an LLM on a dataset using instructions#
Overview#
Instruction fine-tuning refers to the process of fine-tuning language models on datasets with instructions (sometimes called prompts) and corresponding responses (sometimes called completions). This method helps models better follow human instructions, stay on topic (rather than merely doing next-word-completion), and have been shown to help mitigate some foundational models’ bias and harmfulness.
At inference time, users may specify desired behaviors or provide explicit instructions to the instuction fine-tuned model through carefully crafted prompts to elicit more accurate and useful responses from the model.
Prerequisites#
Users need to have access to an instruction fine-tuning dataset.
Instruction fine-tuning dataset#
Docs in an instruction fine-tuning dataset typically contain two parts: a prompt (also known as instructions) and a completion (also known as response).
{
"prompt": "Identify which instrument is string or percussion: Samphor, Viola toeria",
"completion": "Viola toeria is string, Samphor is percussion."
}
Procedure#
Use the Summarization
mode when calling the script file create_hdf5_dataset.py
.
For example, for a sequence length of 2048 tokens:
python create_hdf5_dataset.py Summarization \
--input_dir /path/to/data \
--tokenizer_type NeoXTokenizer \
--encoder_file /path/to/encoder \
--max_seq_length 2048 \
--ftfy True \
--sep_token: null \
--prompt_key: "prompt" \
--completion_key: "completion" \
--pack_sequences False
Pay attention to the last four arguments:
--sep_token: null
when you do not want to add any special token between the prompt and the completion
--prompt_key: "prompt"
when the key for the prompt (instruction) part is “prompt” in the raw data
--completion_key: "target"
when the key for the completion (response) part is “completion” in the raw data
--pack_sequences False
when you do not want to pack multiple docs into a single training sequence
Implementation Notes#
The two important aspects about data preprocessing for fine-tuning datasets are:
Packing is not used
Prompt tokens are masked out while calculating the loss, so the completion tokens contribute only to model weight updates during training