Running autoregressive inference#
Overview#
The Cerebras Wafer-Scale Engine cluster enables autoregressive inference for these large language models from the Model Zoo: GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, and Santacoder. This allows you to generate text continuations for multiple prompt inputs in a batch, facilitating model evaluation downstream. Autoregressive generation is performed greedily, picking the highest probability token at each step.
Preparing input data for autoregressive inference#
To enable batched autoregressive inference, input prompts must be:
Tokenized into IDs based on the model vocabulary
Saved in a single
.h5
file containing one data tensor calleddata
The data tensor should have shape (
num_samples
,max_seq_len
)Select a
start_token
ID not used in any prompt text (the token ID may be outside the model’s vocabulary size)Append
start_token
after each prompt sequenceTokens after
start_token
can be arbitrary padding (recommend using additionalstart_token
)
Setting Inference Parameters#
Data Location and Batch Size#
In addition to preparing the input data, inference configuration requires specifying in a section called inference_input
:
data_processor
: Must be set toGptHDF5MapDataProcessor
data_dir
: Path to the directory containing the.h5
input filebatch_size
: Number of samples to process simultaneously
To run autoregressive inference, add an additional section to your model’s params.yaml
file:
inference_input:
data_processor: "GptHDF5MapDataProcessor"
data_dir: "./path/to/your/data/directory"
batch_size: 60
The batch size does not need to evenly divide the total samples. Any leftover samples will be padded to complete the last batch. Data for the last batch will be padded with dummy samples to ensure that every input sample is processed. For example, if batch size is 100 and you are inferring for 597 samples, the last batch will contain 97 real samples and three dummy padding samples.
You should generally select the same batch size for autoregressive inference as you would for evaluation. However, unlike in evaluation, automatic splitting into sub-batches (“gradient accumulation”) is not currently supported for inference. Therefore, if your model runs with sub-batches in evaluation mode, you should select the sub-batch size for inference. For example, if in evaluation mode the batch size is set to 120 but the model actually runs as 3 sub-batches of 40 each, then for inference you should set a batch size of 40. In general, some experimentation may be required to find an optimal batch size that fits on the device and yields optimal performance.
Model Parameters for Inference#
Additional keys must be added to the model
section of params.yaml
for autoregressive inference:
start_token
- ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.end_token
- ID of a token that, if emitted by the model, indicates that inference should stop for that sample. For example, to stop inferring after a newline character, set this to the token ID of that character. To disable this feature, setend_token
to a high value beyond your model’s vocabulary size. You may specify a list of token IDs instead of a single ID. If you do, the model will stop inference for that sample at a token with any one of the provided IDs. Additionally, the following optional parameters may be set:max_tokens
- Maximum tokens to infer for each sampleloop_dim
- Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e.sequence X samples
instead ofsamples X sequence
)
Running Autoregressive Inference#
To launch an autoregressive inference run, the run_gpt_inference.py
script must be used. It is very similar to the run.py
script normally used to launch evaluation or training, and supports
similar parameters, for example:
python modelzoo/common/pytorch/run_gpt_inference.py CSX \
--params params.yaml \
--model_dir model_dir \
--mount_dirs {paths to modelzoo and data} \
--python_paths {paths to modelzoo and other python code if used} \
--checkpoint_path {path to checkpoint of trained weights}
A few points important to note:
mode
parameter should not be providednum_csx
parameter, if provided, must be 1. Autoregressive inference on multiple CS-2s is not yet supportedinference_steps
optional parameter is supported. If it is provided, inference will only run for the number of batches specified byinference_steps
, and not for your entire dataset. This is useful to validate your flow before starting a large inference job.compile_only
parameter is supported as usual, and may be useful in the process of picking the best batch size for the job, to validate that compilation with a given batch size fits on the device.
Accessing Inference Predictions#
Model predictions (i.e. the output of inference) will be available in the artifacts directory under
your model directory. It will appear in a directory called predictions
, in NumPy files named
predictions_X.npz
where X
is the 1-based batch number (i.e. the global step counter).
These files will appear gradually during the run. That is, there is no need to wait until inference for the entire dataset has completed, in order to access predictions for batches already inferred.
Each .npz
file contains:
global_step: Scalar batch counter
predictions: Batch outputs of shape
(batch_size, max_seq_len)
Each sample in the batch will contain the prompt, immediately followed by the results of inference (without
start_token
). Note that for each sample, inference will stop when either of the following occurs:
Maximum sequence length is reached
The
end_token
is emitted by the model (in which case it will be present in the output data)The
max_tokens
limit is reached for the given sample
The model will pad any tokens after the end of the inferred text with the start_token ID
(or with the first start_token ID
if you provide multiple IDs).
In case of dummy samples for partial batches (e.g. for the last three samples when inferring 597 samples with batch size 100), the dummy samples will consist solely of start_token
in the output.