Running autoregressive inference#

Overview#

The Cerebras Wafer-Scale Engine cluster enables autoregressive inference for these large language models from the Model Zoo: GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, and Santacoder. This allows you to generate text continuations for multiple prompt inputs in a batch, facilitating model evaluation downstream. Autoregressive generation is performed greedily, picking the highest probability token at each step.

Preparing input data for autoregressive inference#

To enable batched autoregressive inference, input prompts must be:

  • Tokenized into IDs based on the model vocabulary

  • Saved in a single .h5 file containing one data tensor called data

  • The data tensor should have shape (num_samples, max_seq_len)

  • Select a start_token ID not used in any prompt text (the token ID may be outside the model’s vocabulary size)

  • Append start_token after each prompt sequence

  • Tokens after start_token can be arbitrary padding (recommend using additional start_token)

Setting Inference Parameters#

Data Location and Batch Size#

In addition to preparing the input data, inference configuration requires specifying in a section called inference_input:

  • data_processor: Must be set to GptHDF5MapDataProcessor

  • data_dir: Path to the directory containing the .h5 input file

  • batch_size: Number of samples to process simultaneously

To run autoregressive inference, add an additional section to your model’s params.yaml file:

inference_input:
    data_processor: "GptHDF5MapDataProcessor"
    data_dir: "./path/to/your/data/directory"
    batch_size: 60

The batch size does not need to evenly divide the total samples. Any leftover samples will be padded to complete the last batch. Data for the last batch will be padded with dummy samples to ensure that every input sample is processed. For example, if batch size is 100 and you are inferring for 597 samples, the last batch will contain 97 real samples and three dummy padding samples.

You should generally select the same batch size for autoregressive inference as you would for evaluation. However, unlike in evaluation, automatic splitting into sub-batches (“gradient accumulation”) is not currently supported for inference. Therefore, if your model runs with sub-batches in evaluation mode, you should select the sub-batch size for inference. For example, if in evaluation mode the batch size is set to 120 but the model actually runs as 3 sub-batches of 40 each, then for inference you should set a batch size of 40. In general, some experimentation may be required to find an optimal batch size that fits on the device and yields optimal performance.

Model Parameters for Inference#

Additional keys must be added to the model section of params.yaml for autoregressive inference:

  • start_token - ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.

  • end_token - ID of a token that, if emitted by the model, indicates that inference should stop for that sample. For example, to stop inferring after a newline character, set this to the token ID of that character. To disable this feature, set end_token to a high value beyond your model’s vocabulary size. You may specify a list of token IDs instead of a single ID. If you do, the model will stop inference for that sample at a token with any one of the provided IDs. Additionally, the following optional parameters may be set:

  • max_tokens - Maximum tokens to infer for each sample

  • loop_dim - Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e. sequence X samples instead of samples X sequence)

Running Autoregressive Inference#

To launch an autoregressive inference run, the run_gpt_inference.py script must be used. It is very similar to the run.py script normally used to launch evaluation or training, and supports similar parameters, for example:

python modelzoo/common/pytorch/run_gpt_inference.py CSX \
       --params params.yaml \
       --model_dir model_dir \
       --mount_dirs {paths to modelzoo and data} \
       --python_paths {paths to modelzoo and other python code if used} \
       --checkpoint_path {path to checkpoint of trained weights}

A few points important to note:

  • mode parameter should not be provided

  • num_csx parameter, if provided, must be 1. Autoregressive inference on multiple CS-2s is not yet supported

  • inference_steps optional parameter is supported. If it is provided, inference will only run for the number of batches specified by inference_steps, and not for your entire dataset. This is useful to validate your flow before starting a large inference job.

  • compile_only parameter is supported as usual, and may be useful in the process of picking the best batch size for the job, to validate that compilation with a given batch size fits on the device.

Accessing Inference Predictions#

Model predictions (i.e. the output of inference) will be available in the artifacts directory under your model directory. It will appear in a directory called predictions, in NumPy files named predictions_X.npz where X is the 1-based batch number (i.e. the global step counter).

These files will appear gradually during the run. That is, there is no need to wait until inference for the entire dataset has completed, in order to access predictions for batches already inferred.

Each .npz file contains:

  • global_step: Scalar batch counter

  • predictions: Batch outputs of shape

  • (batch_size, max_seq_len)

Each sample in the batch will contain the prompt, immediately followed by the results of inference (without start_token). Note that for each sample, inference will stop when either of the following occurs:

  • Maximum sequence length is reached

  • The end_token is emitted by the model (in which case it will be present in the output data)

  • The max_tokens limit is reached for the given sample

The model will pad any tokens after the end of the inferred text with the start_token ID (or with the first start_token ID if you provide multiple IDs).

In case of dummy samples for partial batches (e.g. for the last three samples when inferring 597 samples with batch size 100), the dummy samples will consist solely of start_token in the output.