Train an LLM with a large or small context window#

Overview#

One of the advantages of the CS-2 is the ability to train a Large Language Model (LLM) with a large context window (also known as sequence lengths). Larger context windows enable LLMs to handle larger inputs. Sometimes it is advantageous to train with a shorter context window (e.g., in instruction tuning) for efficiency.

Follow the steps in this guide to train a language model with a large (> 2048) or a small (< 2048) context window.

Procedure#

For large context window#

1. Data processing: when creating your dataset using the script file create_hdf5_dataset.py, change the value of the argument --max_seq_length to the desired value.

For example, for a sequence length of 4096 tokens:

python create_hdf5_dataset.py LMData \
     --input_dir /path/to/data \
     --tokenizer_type NeoXTokenizer \
     --encoder_file /path/to/encoder \
     --max_seq_length 4096 \
     --ftfy True \
     --pack_sequences False

2. Model training: after processing your data, set the max_sequence_length and the max_position_embeddings to the desired value in the model’s configuration yaml file.

For example:

train_input:
    data_processor: "GptHDF5DataProcessor"
    data_dir:
         - ./path/to/train/dataset/with_msl4096/
    ...
    max_sequence_length: 4096
    ...

eval_input:
    data_processor: "GptHDF5DataProcessor"
    data_dir: "./path/to/eval/datast/with_msl4096"
    max_sequence_length: 4096
    ...

model:
    ...
    max_position_embeddings: 4096
    ...

For small context window#

For exapmle, a model that was pretrained on a sequence length of 2048 tokens may be further instruction fine-tuned on a dataset with a sequence length of 256 tokens. In this case assume the longest sequence in the dataset has a length of 256 tokens. Since the sequences of this dataset will be padded, not packed, training with a shorter sequence length is more efficient than padding every sample in the dataset all the way to 2048 tokens.

1. Data processing: when calling the script file create_hdf5_dataset.py, change the value of the argument --max_seq_length to the desired value.

python create_hdf5_dataset.py Summarization \
     --input_dir /path/to/data \
     --tokenizer_type NeoXTokenizer \
     --encoder_file /path/to/encoder \
     --max_seq_length 256 \
     --ftfy True \
     --sep_token: null \
     --prompt_key: "prompt" \
     --completion_key: "completion" \
     --pack_sequences False

2. Model training: after processing your data, set the max_sequence_length to the desired value in the model’s configuration yaml file.

For example:

train_input:
    data_processor: "GptHDF5DataProcessor"
    data_dir:
         - ./path/to/train/dataset/with_msl4096/
    ...
    max_sequence_length: 256
    ...

eval_input:
    data_processor: "GptHDF5DataProcessor"
    data_dir: "./path/to/eval/datast/with_msl4096"
    max_sequence_length: 256
    ...

model:
    ...
    max_position_embeddings: 2048
    ...

Implementation Notes#

Note when training a pretrained model with a smaller context window, the max_position_embeddings parameter of the pretrained model remains the same. Only the max_sequence_length parameter needs to be changed.

On the other hand, when training with a large context window, both the max_position_embeddings and max_sequence_length parameters need to be changed.