Cerebras Model Zoo YAML parameters#

Model parameters#

Common#

Parameter Name

Description

mixed_precision

Whether to use mixed precision training or not. (bool, optional) Default: None

use_bfloat16

Whether to use bfloat16 data type instead of float32. (bool, optional) Default: False See more

Transformer based models#

Parameter Name

Description

Supported Models

attention_dropout_rate

Dropout rate for attention layer. (float, optional) Default: same as dropout

All

attention_kernel

Attention kernel to use. Accepted values:
None - compiler selects the kernel.
"default" - Default implementation.
"optimized_beta" - Optimized implementation. Beta feature, support is limited.
(str/None, optional) Default: None

All

attention_softmax_fp32

Whether to use fp32 precision for attention softmax. (bool, optional) Default: True)

All

attention_type

Type of attention. Accepted values:
"dot_product"
"scaled_dot_product".
(str, optional) Default: "scaled_dot_product"

All

d_ff

Size of the intermediate feed forward layer in each T5Block. (int, optional) Default: 2048

T5, Transformer

d_kv

Size of the query/key/value projections per attention head. d_kv does not have to be equal to d_model//num_heads. (int, optional) Default: 64

T5, Transformer

d_model

The number of expected features in the encoder/decoder inputs. (int, optional) Default 512

All

decoder_nonlinearity

Type of nonlinearity to be used in decoder. (str, optional) Default: "relu"

T5, Transformer

decoder_num_hidden_layers

Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set. (int, optional)

T5, Transformer

disable_nsp

Whether to disable the next sentence prediction task. (bool, optional) Default: False

BERT (pre-training, fine-tuning)

dropout_rate

The dropout probability for all fully connected layers. (float, optional), Default: 0.1

All

embedding_dropout_rate

Dropout rate for embeddings. (float, optional) Default: 0.1

All

embedding_initializer

Initializer to use for embeddings. See supported initializers. (str, optional) Default: “normal”

GPT2, GPT3, GPTJ

encoder_nonlinearity

Type of nonlinearity to be used in encoder. (str, optional) Default: varies per model

BERT (pre-training, fine-tuning), T5, Transformer

encoder_num_hidden_layers

Number of hidden layers in the encoder. (int, optional) Default: 6

T5, Transformer

extra_ids

The number of extra ids used for additional vocabulary items (int, optional) Default: 0

T5, Transformer

filter_size

Dimensionality of the feed-forward layer in the Transformer block. (int, optional) Default: 3072

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

hidden_size

The size of the transformer hidden layers (int, optional) Default: 768

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

initializer

The initializer to be used for all the initializers used in the model. See supported initializers. (str, optional) Default: varies based on model

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

initializer_range

The standard deviation of the truncated_normal_initializer as the default initializer. (float, optional) Default: 0.02

BERT (pre-training), GPT2, GPT3, GPTJ

layer_norm_epsilon

The epsilon value used in layer normalization layers. (float, optional) Default: 1e-5)

All

lm_loss_weight

Value that scales loss by the mean number of predictions per sequence in the dataset. This number varies per dataset and can be calculated by getting the reciprocal of average number of tokens per sequence in the training dataset. This is only needed when setting loss scaling to "batch_size". (float, optional) Default: 1.0

T5, Transformer

loss_scaling

The scaling type used to calculate the loss. Accepts:
batch_size, num_tokens. See more. Note: It is recommended to set this to batch_size when use_cs_grad_accum: True for training stability. (str, optional) Default: num_tokens

GPT2, GPT3, GPTJ

loss_weight

The weight for the loss scaling when loss_scaling: "batch_size", generally set to 1/max_sequence_length. (float, optional) Default: 1.0

GPT2, GPT3, GPTJ

max_position_embeddings

The maximum sequence length that the model can handle. (int, optional) Default: 1024

All

mlm_loss_scaling

A string specifying the scaling factor type used for the language modeling loss. Accepts one of: "num_masked" - uses the off-the shelf loss scaling by number of valid (non-padding) tokens the cross entropy loss function, "precomputed_num_masked" - uses loss scaling from the computed num valid masks in the data loader, when enabling dynamic_loss_weight in the data loader params, "batch_size" - uses loss scaling by "batch_size" and lm_loss_weight should be provided when using "batch_size". (str, optional) Default: "batch_size"

T5, Transformer

mlm_loss_weight

The weight for the masked language modeling loss used when scaling the loss with "batch_size". This number varies per dataset and can be calculated by getting the reciprocal of average number of masked tokens per sequence in the training dataset. (float, optional) Default: 1.0

BERT (pre-training)

nonlinearity

The non-linear activation function used in the feed forward network in each transformer block. See list of non-linearity functions here. Some may have to use autogen_policy: "medium". (str, optional) Default: varies per model

BERT (pre-training, fine-tuning), GPT2, GPT3, GPTJ

num_heads

The number of attention heads in the multi-head attention layer. (int, optional) Default: varies per model

All

num_hidden_layers

Number of hidden layers in the Transformer encoder/decoder. (int, optional) Default: 12

All

output_layer_initializer

The name of the initializer for the weights of the output layer. See supported initializers. (str, optional) Default: varies based on model

GPT2, GPT3, GPTJ

position_embedding_type

The type of position embedding to use in the model. Can be one of: "fixed" - Sinusoidal from original Transformer, "relative" - Relative position embedding, to exploit pairwise, relative positional information., "rotary" - a.k.a RoPE , "learned" - Learned embedding matrix, None (str, optional) Default: varies per model

All

relu_dropout_rate

The dropout rate for ReLU activation function. (float, optional) Default: varies per model

T5, Transformer

residual_dropout_rate

The dropout rate for residual connections. (float, optional) Default: 0.1

GPTJ

rotary_dim

The number of dimensions used for the rotary position encoding. Must be an even number. (int, optional) Default: None

GPTJ

share_embedding_weights

Whether to share the embedding weights between the input and out put embedding. (bool, optional) Default: True

All

share_encoder_decoder_embedding

Whether to share the embedding weights between the encoder and decoder (bool, optional) Default: True

T5, Transformer

src_vocab_size

The size of the source vocabulary. Max supported value: 512000. (int, optional) Default: 32128

T5, Transformer

tgt_vocab_size

The size of the target vocabulary. Max supported value: 512000. (int, optional) Default: 32128

T5, Transformer

use_bias_in_output

Whether to use bias in the final output layer. (bool, optional) Default: False

GPT2, GPT3, GPTJ

use_dropout_outside_residual_path

Whether to set dropout calculations outside of the residual path. (bool, optional) Default: True for T5, False for Transformer

T5, Transformer

use_ffn_bias

Whether to use bias in the feedforward network (FFN). (bool, optional) Default: varies per model

All

use_ffn_bias_in_attention

Whether to include bias in the attention layer for feed-forward network (FFN). (bool, optional) Default: varies per model

All

use_position_embedding

Whether to use position embedding in the model. (bool, optional) Default: True

GPT2, GPT3

use_pre_encoder_decoder_dropout

Whether to use dropout layer after positional embedding layer and encoder/decoder. (bool, optional) Default: False

T5, Transformer

use_pre_encoder_decoder_layer_norm

Whether to use layer norm before passing input tensors into encoder/decoder. (bool, optional) Default: True

T5, Transformer

use_projection_bias_in_attention

Whether to include bias in the attention layer for projection. (bool, optional) Default: varies per model

All

norm_type

Whether to use T5 layer norm (a.k.a rmsnorm, with no mean subtraction and bias correction) or use the regular nn.LayerNorm module. (str, optional) Default: layernorm

T5, Transformer

use_transformer_initialization

The Transformer model tends to converge best with a scaled variant on Xavier uniform initialization used for linear layers. This contrasts the initialization used for the original T5 paper, which uses He normal initialization for linear layers. Setting this flag to True switches the initialization to the Transformer specific scaled Xavier initialization. (bool, optional) Default: False

T5, Transformer

use_untied_layer_norm

Whether to use untied layer normalization. (bool, optional) Default: False

GPTJ

vocab_size

The size of the vocabulary used in the model. Max supported value: 512000. (int, optional) Default: varies per model

All

Computer Vision models#

Parameter

Description

Supported Models

bias_initializer

Initializer for the bias. (str, optional) Default: "zeros"

UNet

convs_per_block

List of conv specifications for each conv in the block. (List[str], required)

UNet

decoder_filters

List of filter sizes for each block in the decoder. (List[str], required)

UNet

downscale_bottleneck

Whether to downsample the spatial dimensions in the UNet bottleneck block. (bool, optional) Default: False

UNet

downscale_encoder_blocks

Determine whether each block in the Encoder includes downsampling. Length of the list must correspond to the number of UNetBlocks in the Encoder. If a single bool is provided, all blocks will use this value. (bool/List[bool], optional) Default: True

UNet

downscale_first_conv

If True, the first convolution operation in each UNetBlock will be downscaled. If False, the last convolution in each UNetBlock will be downscaled. (bool, optional) Default: False

UNet

downscale_method

Downscaling method at the end of each block. One of "max_pool" or "strided_conv". (str, optional) Default: "max_pool"

UNet

enable_bias

Whether to include a bias operation following convolution layers. By default, bias will only be included when no normalization is used after the convolution layers.

UNet

encoder_filters

List of filter sizes for each block in the encoder. (List[str], required)

UNet

eval_ignore_classes

List of classes to ignore during evaluation of model. (List[int], optional)

UNet

eval_metrics

List of evaluation metrics to use during training and validation. Available options are accuracy (Acc), mean IOU (mIOU) or Dice (DSC). (List[str], optional).

UNet

initializer

Initializer for the convolution weights. See supported initializers (str, required)

UNet

input_channels

Number of channels in the input images to the model. (int, required)

UNet

loss

Loss type, supported: values: "bce", "multilabel_bce", "ssce" (str, required)

UNet

nonlinearity

Activation function used in the model following convolutions in the encoder and decoder. (str, required)

UNet

norm_kwargs

args to be passed to norm layers during initialization. For
norm_type = group, norm_kwargs must include num_groups key value pair.
norm_type = layer, norm_kwargs must include normalized_shape key value pair.
(dict, optional) Default: None

UNet

norm_layer

Type of normalization to be used. See [supported norm layers]](./common/pytorch/model_utils/norms.py). (str, optional) Default: "batchnorm2d"

UNet

residual_blocks

Flag for using residual connections at the end of each block. (bool, optional) Default: False

UNet

skip_connect

Flag for if the model concatenates encoder outputs to decoder inputs. (bool, optional) Default: True

UNet

use_conv3d

Whether to use 3D convolutions in the model. (bool, optional) Default: False

UNet

Data loader parameters#

Common#

Parameter Name

Description

batch_size

Effective batch size of the input data. (int, required)

data_dir

Path/s to the data files to use. (str/List[str], required)

data_processor

Name of the data processor to be used. (str, required)

mixed_precision

Flag to cast input to fp16. (bool, optional) Default: None

micro_batch_size

Micro batch size to force for gradient accumulation. Only applies to ‘CSX’ runs. Please set num_csx and batch_size such that batch_size//num_csx is a multiple of micro_batch_size. (int, optional) Default: None

num_workers

Number of workers to use in the dataloader. See more. (int, optional) Default: 0

persistent_workers

For multi-worker dataloader controls if the workers are recreated at the end of each epoch (see PyTorch docs). (bool, optional) Default: True

prefetch_factor

Number of samples loaded in advance by each worker. (int, optional) Default: 10

shuffle

Flag to enable data shuffling. (bool, optional) Default: True

shuffle_buffer

Size of shuffle buffer in samples. (int, optional) Default: 10 * batch_size

shuffle_seed

Shuffle seed. (int, optional) Default: None

Transformers#

Parameter Name

Description

Supported Models

do_lower

Flag to lower case the texts. (bool, optional) Default: False

BERT (pre-training, fine-tuning), T5, Transformer

dynamic_loss_weight

Flag to dynamically scale the loss. If set, will divide the loss for a token by the length of the sequence that the token comes from. Use with "precomputed_num_tokens" loss scaling. (bool, optional) Default: False

T5, Transformer

dynamic_mlm_scale

Flag to dynamically scale the loss. If set, MLM Loss is scaled by the number of masked tokens in the current batch using the masked_lm_weights from the input data features. (bool, optional) Default: False

BERT (pre-training)

extra_ids

Number of sentinel tokens for T5 objective. (int, optional) Default: 0

T5, Transformer

masked_lm_prob

Ratio of the masked tokens over the sequence length. (float, optional) Default: 0.15

BERT (pre-training)

max_predictions_per_seq

Maximum number of masked tokens per sequence. (int, required)

BERT (pre-training)

max_sequence_length

Maximum sequence length of the input data. (int, optional) Default: varies per model

All

src_data_dir

Path to directory containing all the files of tokenized data for source sequence. (str, required)

T5, Transformer

src_max_sequence_length

Largest possible sequence length for the input source sequence. If longer it will be truncated. All other sequences padded to this length. (int, required)

T5, Transformer

src_vocab_file

Path to vocab file for source input. (str, required)

T5, Transformer

tgt_data_dir

Path to directory containing all the files of tokenized data for target sequence. (str, required)

T5, Transformer

tgt_max_sequence_length

Largest possible sequence length for the input target sequence. If longer it will be truncated. All other sequences padded to this length. (int, required)

T5, Transformer

tgt_vocab_file

Path to vocab file for target input. (str, required)

T5, Transformer

vocab_file

Path to vocab file. (str, required)

BERT (pre-training, fine-tuning)

vocab_size

The size of the vocabulary used in the model. (int, required)

BERT (pre-training, fine-tuning)

Computer Vision#

Parameter Name

Description

Supported Models

aggregate_cartilage

For SKM-TEA dataset only. Combines medial and lateral classes into single class. (bool, optional) Default: True

UNet

augment_data

Apply data augmentation to the data. (bool, optional) Default: True

UNet

class_id

For the Severstal Dataset this sets which class id to be considered as the positive class. All other classes will be considered negative examples. (int, optional)

UNet

echo_type

For SKM-TEA dataset only. Specifies training data configuration. Allowed options are: echo1, echo2, or root_sum_of_squares. (str, required) Default: echo1

UNet

image_shape

Expected shape of output images in format (H, W, C), (List[int], required)

UNet

normalize_data_method

Specify the strategy to normalize the input data. One of: "zero_centered","zero_one","standard_score". (str, required)

UNet

num_classes

Number of classes in the training dataset. (int, required)

UNet

train_test_split

Percentage of data to be used in the training dataset.

UNet

use_fast_dataloader

If set to True, mapstyle datasets that use the UNetDataProcessor perform faster data processing. (bool, optional) Default: False

UNet

use_worker_cache

If set to True data will be read from local SSD memory on the individual worker nodes during training. If the data does not exist on the worker nodes it will be automatically copied from the host node. This will cause a slowdown the first time this copy takes place. (bool, optional) Default: True

UNet

Optimizer parameters#

Parameter Name

Description

initial_loss_scale

Initial loss scale to be used in the grad scale. (int, optional) Default: 2 ** 15

learning_rate

Learning rate scheduler to be used. See supported LR schedulers. (dict, required)

log_summaries

Flag to log per layer gradient norm in Tensorboard (bool, optional) Default: False

loss_scaling_factor

Loss scaling factor for gradient calculation in learning step. (float/str, optional) Default: 1.0

max_gradient_norm

Max norm of the gradients for learnable parameters. Used for gradient clipping.(float, optional) Default: None

min_loss_scale

The minimum loss scale value that can be chosen by dynamic loss scaling. (float, optional) Default: None

max_loss_scale

The maximum loss scale value that can be chosen by dynamic loss scaling. (float, optional) Default: None

optimizer_type

Optimizer to be used. See supported optimizers. (str, required)

Runconfig parameters#

Key

Description

Supported mode

autogen_policy

The autogen policy to be used for the given run.
Can be one of: "default", "disabled", "mild", "medium", "aggressive". See more.
(str, optional) Default: None

CSX

autoload_last_checkpoint

Flag to automatically load the last checkpoint in the model_dir. (bool, optional) Default: True

All

check_loss_values

Flag to check the loss values to see if it is Nan/inf. (bool, optional) Default: True

All

checkpoint_path

The path to load checkpoints from during training. (str, optional) Default: None

All

checkpoint_steps

The number of steps between saving model checkpoints during training. 0 means no checkpoints saved. (int, optional) Default: 0

All

compile_dir

Compile directory where compile artifacts will be written. (str, optional) Default: None

All

compile_only

Enables compile only workflow. (bool, optional) Default: False

All

credentials_path

Credentials for cluster access. If None, the value from a pre-configured location will be used if available. (str, optional) Default: None

CSX

debug_args_path

ath to debugs args file. (str, optional) Default: None

CSX

disable_strict_checkpoint_loading

Flag used in conjunction with checkpoint_path, to avoid enforcing strict model state loading. (bool, optional) Default: False

All

dist_addr

To init master_addr and master_port of distributed. (str, optional) Default: localhost:8888

GPU

dist_backend

Distributed backend engine. (str, optional) Default: "nccl"

GPU

enable_distributed

Flag to enable distributed training on GPU. (bool, optional) Default: False

GPU

enable_summaries

Enable summaries when running on CS-X hardware. (bool, optional) Default: False

CSX

eval_frequency

Specifies the evaluation frequency during training. Only used for train_and_eval mode. (int, optional) Default: None

All

eval_steps

Specifies the number of steps to run the model evaluation. (int, optional) Default: None

All

init_method

URL specifying how to initialize the process group. (str, optional) Default: "env://"

GPU

job_labels

A list of equal-sign-separated key value pairs served as job labels. (str, optional) Default: None

CSX

load_checkpoint_states

Comma-separated string of keys used in conjunction with checkpoint_path to explicitly specify what components’ state should be loaded if present in a checkpoint. If this flag is used, any component whose key isn’t specified will not load state from the checkpoint. For example, if load_checkpoint_states is "model", we only load the model state and enforce resetting of optimizer states and training steps after loading a given checkpoint; i.e., matching weights are initialized from checkpoint provided by checkpoint_path, training starts from step 0, and optimizer states present in the checkpoint are ignored. This is useful for fine-tuning runs on different tasks (e.g., classification, Q&A, etc.) where weights from a pre-trained model trained on language modeling (LM) tasks are loaded or fine-tuning on a different dataset on the same LM task. If dataloader state exists in the checkpoint, that will also be ignored. In this case, the dataloaders will yield samples from the beginning. However, if load_checkpoint_states is "model,dataloader"then only the model and dataloader states will be loaded. By default, this config is None meaning that we load state for every compononent found in the checkpoint. (str, optional) Default: None

All

steps_per_epoch

The number of steps per epoch. (int, optional) Default: None

All

log_steps

Specifies the number of steps between logging during training. Same number controls the summary steps in Tensorboard. (int, optional) Default: None

All

logging

Specifies the logging level during training. (str, optional) Default: "INFO"

All

max_steps

Specifies the maximum number of steps for training. max_steps is optional unless neither num_epochs nor num_steps are provided, in which case max_steps must be provided. (int, required)

All

mgmt_address

The address of the management service used for coordinating the training job as <host>:<port>. (str, optional)

CSX

mode

The mode of the training job, either ‘"train"’, ‘"eval"’, "eval_all" or "train_and_eval". (str, required)

All

model_dir

The directory where the model checkpoints and other metadata will be saved during training. (str, optional) Default: ./model_dir

All

mount_dirs

A list of paths to be mounted to the appliance containers. It should generally contain path to the directory containing the Cerebras model zoo and data dir. (List[str], optional) Default: None

CSX

num_act_servers

Number of activation servers per CS-X dedicated to stream samples to the WSE. Input workers stream data to these activation servers, and the activation servers to hold and further stream the data to the WSE. For LLMs, we generally choose 1 because they’re compute-bound. For CV models we choose a higher number, a crude rule of thumb is to have one activation server for every 4 workers (i.e. num_workers_per_csx // 4 if num_workers_per_csx > 4, else 1). It is suggested to keep the default values for this param when possible. (int, optional) Default: 1

CSX

num_csx

The number of CSX systems to use in Cerebras WSE cluster. (int, optional) Default: 1

CSX

num_epochs

The number of epochs to train for. (int, optional) Default: None

All

num_steps

The number of steps to train for. (int, optional) Default: None

All

num_wgt_servers

Upper bound on the number of MemoryX servers used for storing the model weights. Compilation may choose a smaller number depending on the model topology. A sensible upper bound (currently 24) is selected if a value is not provided. (int, optional) Default: None

CSX

num_workers_per_csx

Number of input workers, per CSX, to use for streaming samples. This setting depends on whether the model is compute-bound or input-bound and how efficient the dataloader implementation is. For compute-bound models (e.g., LLM), even 1 input worker per csx is enough to saturate the input buffers on CSX systems. But for smaller models a larger number may be used. We currently default to 1 worker per CSX. (int, optional) Default: 0

CSX

precision_opt_level

Setting to control the level of numerical precision used for training runs for large NLP models. See more. (int, optional) Default: 1

CSX

python_paths

A list of paths to be exported into PYTHONPATH for worker containers. It should generally contain path to the directory containing the Cerebras model zoo. (List[str], optional) Default: None

CSX

save_initial_checkpoint

Whether to save an initial checkpoint before training starts. (bool, optional) Default: False

All

save_losses

Whether to save the loss values during training. (bool, optional) Default: True

All

seed

The seed to use for random number generation for reproducibility. (int, optional) Default: None

All

sync_batchnorm

Whether to use synchronized batch normalization on multi GPU setup. (bool, optional) Default: False

GPU

target_device

The target device to run the training on. One of: CPU, GPU, CSX. Required in command line. (str, optional) Default: command line value

All

use_cs_grad_accum

Whether to use gradient accumulation to support larger batch sizes. (bool, optional) Default: False

CSX

validate_only

Enables validate only workflow, stops the compilation at kernel matching stage. (bool, optional) Default: False

CSX