YAML params for TensorFlow models#
Model params#
Transformer based models#
Parameter Name |
Description |
Supported Models |
---|---|---|
attention_dropout_rate |
Dropout rate for attention layer. ( |
All |
attention_type |
Type of attention. Accepted values: |
All |
boundary_casting |
Flag to cast outputs the values in half precision and casts the input values up to full precision. ( |
All |
decoder_nonlinearity |
Type of nonlinearity to be used in decoder. ( |
T5, Transformer |
decoder_num_hidden_layers |
Number of hidden layers in the Transformer decoder. Will use the same value as |
T5, Transformer |
disable_nsp |
Whether to disable the next sentence prediction task. ( |
BERT |
dropout_rate |
The dropout probability for all fully connected layers. ( |
All |
dropout_seed |
Seed with which to initialize the dropout layer. ( |
All |
embedding_initializer |
Initializer to use for embeddings. See supported initializers. ( |
All |
encoder_nonlinearity |
Type of nonlinearity to be used in encoder. ( |
BERT, Linformer, T5, Transformer |
encoder_num_hidden_layers |
Number of hidden layers in the encoder. ( |
T5, Transformer |
filter_size |
Dimensionality of the feed-forward layer in the Transformer block. ( |
All |
hidden_size |
The size of the transformer hidden layers ( |
All |
initializer |
The initializer to be used for all the initializers used in the model. See supported initializers. ( |
All |
layer_norm_epsilon |
The epsilon value used in layer normalization layers. ( |
All |
loss_scaling |
The scaling type used to calculate the loss. Accepts: |
GPT2, GPT3, GPTJ |
loss_weight |
The weight for the loss scaling when |
GPT2, GPT3, GPTJ |
max_position_embeddings |
The maximum sequence length that the model can handle. ( |
All |
mixed_precision |
Whether to use mixed precision training or not. ( |
All |
nonlinearity |
The non-linear activation function used in the feed forward network in each transformer block. ( |
All |
num_heads |
The number of attention heads in the multi-head attention layer. ( |
All |
num_hidden_layers |
Number of hidden layers in the Transformer encoder/decoder. ( |
All |
output_layer_initializer |
The name of the initializer for the weights of the output layer. See supported initializers. (str, optional) Default: varies based on model |
GPT2, GPT3, GPTJ, T5 |
position_embedding_type |
The type of position embedding to use in the model. Can be one of: |
All |
precision_opt_level |
Setting to control the level of numerical precision used for training runs for large NLP models in weight-streaming. See more. ( |
GPT2, GPT3, GPTJ |
rotary_dim |
The number of dimensions used for the rotary position encoding. ( |
GPTJ |
share_embedding_weights |
Whether to share the embedding weights between the input and out put embedding. ( |
BER, GPT2, GPT3, GPTJ, Linformer |
share_encoder_decoder_embedding |
Whether to share the embedding weights between the encoder and decoder ( |
T5, Transformer |
tf_summary |
Flag to save the activations with the |
All |
use_bias_in_output |
Whether to use bias in the final output layer. ( |
GPT2, GPT3, GPTJ |
use_ffn_bias |
Whether to use bias in the feedforward network (FFN). ( |
All |
use_ffn_bias_in_attention |
Whether to include bias in the attention layer for feed-forward network (FFN). ( |
All |
use_position_embedding |
Whether to use position embedding in the model. ( |
BERT, GPT2, GPT3, Linformer |
use_projection_bias_in_attention |
Whether to include bias in the attention layer for projection. ( |
All |
use_segment_embedding |
Whether to use segment embedding in the model. ( |
BERT, Linformer |
use_untied_layer_norm |
Flag to use untied layer norm in addition to the default layer norm. ( |
GPTJ |
use_vsl |
Whether to enable variable sequence length. See more. ( |
BERT (pre-training), T5, Transformer |
vocab_size |
The size of the vocabulary used in the model. ( |
All |
weight_initialization_seed |
Seed applied for weights initialization. ( |
All |
Supported Initializers#
Supported initializers include:
"constant"
"uniform"
"glorot_uniform"
"normal"
"glorot_normal"
"truncated_normal"
"variance_scaling"
Data loader params#
Transformers#
Parameter Name |
Description |
Supported Models |
---|---|---|
add_special_tokens |
Flag to add special tokens in the data loader. Special tokens are defined based on the data processor. ( |
BERT, GPT2, GPT3 |
batch_size |
Batch size of the data. ( |
All |
buckets |
A list of boundaries for sequence lengths to bucket together in order to speed up VTS/VSL. ( |
BERT (pre-training), T5, Transformer |
data_dir |
Path/s to the data files to use. ( |
All |
data_processor |
Name of the data processor to be used. ( |
All |
do_lower |
Flag to lower case the texts. ( |
BERT, Linformer, T5, Transformer |
mask_whole_word |
Flag to mask the whole words. ( |
BERT, Linformer |
max_predictions_per_seq |
Maximum number of masked tokens per sequence. ( |
BERT, Linformer |
max_sequence_length |
Maximum sequence length of the input data. ( |
All |
mixed_precision |
Flag to cast input to fp16. ( |
All |
n_parallel_reads |
For call to |
All |
repeat |
Flag to specify if the dataset should be repeated. ( |
All |
scale_mlm_weights |
Scales |
BERT, Linformer |
shuffle |
Flag to enable data shuffling. ( |
All |
shuffle_buffer |
Size of shuffle buffer in samples. ( |
All |
shuffle_seed |
Shuffle seed. ( |
All |
src_data_dir |
Path to directory containing all the files of tokenized data for source sequence. ( |
T5, Transformer |
src_max_sequence_length |
Largest possible sequence length for the input source sequence. If longer it will be truncated. All other sequences padded to this length. ( |
T5, Transformer |
src_vocab_file |
Path to vocab file for source input. ( |
T5, Transformer |
tgt_data_dir |
Path to directory containing all the files of tokenized data for target sequence. ( |
T5, Transformer |
tgt_max_sequence_length |
Largest possible sequence length for the input target sequence. If longer it will be truncated. All other sequences padded to this length. ( |
T5, Transformer |
tgt_vocab_file |
Path to vocab file for target input. ( |
T5, Transformer |
use_multiple_workers |
Flag to specify if the dataset will be sharded. ( |
All |
vocab_file |
Path to vocab file. ( |
BERT (pre-training, fine-tuning) |
vocab_size |
The size of the vocabulary used in the model. ( |
BERT (pre-training, fine-tuning) |
Optimizer params#
Parameter Name |
Description |
---|---|
initial_loss_scale |
Initial loss scale to be used in the grad scale. ( |
learning_rate |
Learning rate scheduler to be used. See supported LR schedulers ( |
log_summaries |
Flag to log per layer gradient norm in Tensorboard ( |
loss_scaling_factor |
Loss scaling factor for gradient calculation in learning step. ( |
max_gradient_norm |
Max norm of the gradients for learnable parameters. Used for gradient clipping.( |
min_loss_scale |
The minimum loss scale value that can be chosen by dynamic loss scaling. ( |
max_loss_scale |
The maximum loss scale value that can be chosen by dynamic loss scaling. ( |
optimizer_type |
Optimizer to be used. Supported optimizers: |
ws_summary |
Flag to add weights summary into the Tensorboard. ( |
Supported learning rate schedulers#
Currently supports for following learning rates:
constant
cosine
exponential
linear
polynomial
piecewise constant
learning_rate
can be specified in yaml as:
a single float for a constant learning rate
a dict representing a single decay schedule
a list of dicts (for a series of decay schedules)
Runconfig params#
Key |
Description |
Supported mode |
---|---|---|
enable_distributed |
Flag to enable distributed training on GPU. ( |
GPU |
eval_steps |
Specifies the number of steps to run the model evaluation. ( |
All |
keep_checkpoint_max |
Total number of most recent checkpoints to keep in the |
All |
log_step_count_steps |
Specifies the number of steps between logging during training. ( |
All |
max_steps |
Specifies the maximum number of steps for training. |
All |
mode |
The mode of the training job, either ‘ |
All |
model_dir |
The directory where the model checkpoints and other metadata will be saved during training. ( |
All |
multireplica |
Whether to allow multiple replicas for the same graph. See more. ( |
CSX (pipeline mode) |
num_wgt_servers |
The number of weight servers to use in weight streaming execution. ( |
CSX (weight streaming) |
save_checkpoints_steps |
The number of steps between saving model checkpoints during training. |
All |
save_summary_steps |
This number controls the summary steps in Tensorboard. ( |
All |
tf_random_seed |
The seed to use for random number generation for reproducibility. ( |
All |
use_cs_grad_accum |
Whether to use gradient accumulation to support larger batch sizes. ( |
CSX |