Running autoregressive inference#


The Cerebras Wafer-Scale Engine cluster enables autoregressive inference for these large language models from the Model Zoo: GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, and Santacoder. This allows you to generate text continuations for multiple prompt inputs in a batch, facilitating model evaluation downstream. Autoregressive generation is performed greedily, picking the highest probability token at each step.

Preparing input data for autoregressive inference#

To enable batched autoregressive inference, input prompts must be:

  • Tokenized into IDs based on the model vocabulary

  • Saved in a single .h5 file containing one data tensor called data

  • The data tensor should have shape (num_samples, max_seq_len)

  • Select a start_token ID not used in any prompt text (the token ID may be outside the model’s vocabulary size)

  • Append start_token after each prompt sequence

  • Tokens after start_token can be arbitrary padding (recommend using additional start_token)

Setting Inference Parameters#

Data Location and Batch Size#

In addition to preparing the input data, inference configuration requires specifying in a section called inference_input:

  • data_processor: Must be set to GptHDF5MapDataProcessor

  • data_dir: Path to the directory containing the .h5 input file

  • batch_size: Number of samples to process simultaneously

To run autoregressive inference, add an additional section to your model’s params.yaml file:

    data_processor: "GptHDF5MapDataProcessor"
    data_dir: "./path/to/your/data/directory"
    batch_size: 60

The batch size does not need to evenly divide the total samples. Any leftover samples will be padded to complete the last batch. Data for the last batch will be padded with dummy samples to ensure that every input sample is processed. For example, if batch size is 100 and you are inferring for 597 samples, the last batch will contain 97 real samples and three dummy padding samples.

When configuring autoregressive inference, it’s recommended to start with the same batch size and, if applicable, micro-batch size that were used during the evaluation phase. This approach provides a baseline for performance and resource utilization. However, since the optimal batch size can vary depending on the specific model and hardware configuration, you may need to conduct some experiments to identify the best batch size and micro-batch size that maximize performance while fitting within your device’s memory constraints. For a more systematic approach to exploring different batch sizes and finding the one that suits your needs, refer to the automatic batch exploration guide. This guide offers detailed steps and considerations to help you efficiently determine the most effective batch configuration for your inference tasks.

Model Parameters for Inference#

Additional keys must be added to the model section of params.yaml for autoregressive inference:

  • start_token - ID of the special token that indicates where to start inferring for each sample, as described above. You may specify a list of token IDs instead of a single ID. If you do, the model will start inference at the first token that matches any one of the provided IDs. The model will pad inferred predictions with the first ID in the list.

  • stop_sequences - List of sequences (each one being a list of token IDs). If any one of these sequences is emitted by the model, inference will stop for that sample. For example, suppose you would like to stop inferring after either a newline character (e.g. token id 1), or a combination of a period (e.g. token id 2) followed by a space (e.g. token id 3). In this case, set stop_sequences to [[1], [2, 3]]. To stop inferring after seeing a newline character only, set stop_sequences to [[1]]. To disable this feature, set stop_sequences to an empty list []. Additionally, the following optional parameters may be set:

  • max_tokens - Maximum tokens to infer for each sample

  • loop_dim - Indicates the sequence dimension in the input and output data. Default value is 1. If set to 0, indicates that both input and output data is transposed (i.e. sequence X samples instead of samples X sequence)

Running Autoregressive Inference#

To launch an autoregressive inference run, the script must be used. It is very similar to the script normally used to launch evaluation or training, and supports similar parameters, for example:

python modelzoo/common/ CSX \
       --params params.yaml \
       --model_dir model_dir \
       --mount_dirs {paths to modelzoo and data} \
       --python_paths {paths to modelzoo and other python code if used} \
       --checkpoint_path {path to checkpoint of trained weights}

A few points important to note:

  • mode parameter should not be provided

  • inference_steps optional parameter is supported. If it is provided, inference will only run for the number of batches specified by inference_steps, and not for your entire dataset. This is useful to validate your flow before starting a large inference job.

  • compile_only parameter is supported as usual, and may be useful in the process of picking the best batch size for the job, to validate that compilation with a given batch size fits on the device.

Accessing Inference Predictions#

Model predictions (i.e. the output of inference) will be available in the artifacts directory under your model directory. It will appear in a directory called predictions, in NumPy files named predictions_X.npz where X is the 1-based batch number (i.e. the global step counter).

These files will appear gradually during the run. That is, there is no need to wait until inference for the entire dataset has completed, in order to access predictions for batches already inferred.

Each .npz file contains:

  • global_step: Scalar batch counter

  • predictions: Batch outputs of shape

  • (batch_size, max_seq_len)

Each sample in the batch will contain the prompt, immediately followed by the results of inference (without start_token). Note that for each sample, inference will stop when either of the following occurs:

  • Maximum sequence length is reached

  • Any one of the specified stop_sequences is emitted by the model (in which case it will be present in the output data)

  • The max_tokens limit is reached for the given sample

The model will pad any tokens after the end of the inferred text with the start_token ID (or with the first start_token ID if you provide multiple IDs).

In case of dummy samples for partial batches (e.g. for the last three samples when inferring 597 samples with batch size 100), the dummy samples will consist solely of start_token in the output.


The Cerebras Wafer-Scale Engine cluster provides a powerful platform for running autoregressive inference on a selection of large language models, enabling efficient text generation and model evaluation. The process involves meticulous preparation of input data and careful setting of inference parameters to ensure the input workers properly process batched prompts. By leveraging the specialized appliance_environ object and configuring the model’s params.yaml file, users can tailor their inference runs to their specific requirements, optimizing for batch size and other parameters to maximize efficiency and effectiveness. The availability of predictions in real-time during inference runs offers immediate insights, allowing users to rapidly iterate and refine their models. This streamlined approach to autoregressive inference on the Cerebras system underscores its capability to handle complex, large-scale language models, providing a robust toolset for advanced text generation tasks.