Train an LLM using Maximal Update Parameterization#

Overview#

GPT-style models are commonly trained using Standard Parametrization (SP) where model weights are initialized from normal distributions with constant standard deviation or standard deviation based on the shape of each layer. However, as models scale to many parameters, the parametrization does not account for potential inter-layer interactions, resulting in unstable training. This instability can cause costly restarts. Further, hyperparameters do not transfer well as models scale in size.

We can address these challenges by using μP (Maximal Update Parameterization) μP by enabling:

  • Stable training dynamics at a large scale by controlling the initialization, activations magnitude, and layer-wise adaptive learning rates independent of model width

  • Zero-shot hyperparameter transfer from a smaller model to larger model. Essentially, μP facilitates width invariance to the model’s hyperparameters.

For example, if the Cerebras-GPT family of models was trained using both SP and μP parametrizations. For SP configuration, the hyperparameters were chosen from the best results found in the literature. For μP configurations, μTransfer enabled scaling hyperparameters from a 40M parameter model to models up to 2.7B parameters, resulting in a lower loss. These results highlight how μP can improve large model scaling, improving accuracy, and hyperparameter predictability at scale.

Note

Appendix G in Cerebras-GPT paper contains the learnings from the Cerebras team while using μP parametrization, including implementation differences with versus SP parametrization, μP hyperparameter search details, and advice for practitioners regarding the critical batch size.

In particular, we suggest looking at Table 14: Cheat Sheet: All implementation details required to compare SP and μP, to identify differences between μP and SP parametrizations.

Procedure#

GPT-2, GPT-3, Bloom, Llama, Falcon, Starcoder, and MPT with decoder-only architecture and autoregressive language modeling objective are supported by μP parameterizaiton. μP works with differet types of position embeddings such as ALiBi, RoPE, Relative and Fixed position embeddings, with efficient attention architectures like Multi-Query Attention, and activation functions such as SwiGLU. All configuration files in these models are SP parametrized unless noted as <name_config>_mup.yaml. You can find SP and μP parametrizations in the Cerebras Model Zoo. For more information, refer to the configuration yaml files.

1. Execute both config files using the run.py script of the preferred GPT-style model.

2. Specify the path to the config file using the --config flag. For more information on launching a Cerebras job, visit the Quickstart guide.

Differences between SP and μP parametrization#

Structure#

μP parametrization has additional hyperparameters that scale internal layers, including adding element-wise activation tensor scaling and adding layer-wise learning rates scaling to certain layers. These hyperparameters are specified in the config file as follows:

model: 
    ...
    # muP
    scale_qk_dot_by_d: True
    output_logits_scale: ...
    embeddings_scale: ...

optimizer:
    ...
    adjust_learning_rate:
        decoder_kernel: ...

Initialization values#

μP parametrization requires adjusting initializers for affected layers. Also, these hyperparameters are intended to be derived using μTransfer approach instead of reproducing results found in the literature. In the next section, you will learn more about computing values for μTransfer approach.

Note

To learn more about the mathematical differences between μP and SP, we suggest Table 14: Cheat Sheet: All implementation details required to compare SP and μP, in the Appendix G of Cerebras-GPT paper.

You can find examples of and μP parametrization in the configurations of Cerebras-GPT for models of the same size, including Cerebras-GPT of 2.7B parameters using sP and muP parametrizations.

Transfering hyperparameters#

μTransfer is a hyperparameter transfer paradigm that makes zero-shot transfer of near-optimal hyperparameters possible from a small version of the model to a large model via μP (Maximal Update Parameterization). The “small model” is called the proxy-model for which the hyperparameters are tuned, and the “large model” is referred to as the target-model.

First, you will tune nearly optimal hyperparameters of the “proxy-model”. In particular, the optimal learning rate, the initialization standard deviation, and the embedding multiplier will be relevant to derive scaling formulas for the “target-model”. As an example, the Cerebras-GPT family used a 200 sample random hyperparameter search on a 40M parameter proxy model trained on 600M tokens with a batch size of 131k tokens. You can find results and details on the optimal hyperparameters in Appendix G of the paper.

Second, you will scale up the hyperparameters based on the size of the target-model. For this, it is helpful to identify the different types of hyperparameters in a GPT-style model:

μTransferred Across

μTransferable

Not μTransferable

They define the training scale

They can transfer from the small to the large model

They do not work with μTransfer, because they depend on model size and data size

Theoretically demonstrated: width; Empirically demonstrated: depth, batch size,training time, seq length

Optimization related (learning rate (LR), momentum, Adam beta, LR schedule, etc), inititalization (per-layer init. variance),parameter multipliers (multiplicative constants after weight/biases)

Regularization (dropout, weight decay, etc)

Scaling the hyperparameter values from proxy to target model is defined by a series of equations depending on the layer width multiplier \(m_{width}=d_{target}/d_{proxy}\), where \(d\) stands for the width of each layer. For example, let’s use the 40M parameter Cerebras-GPT model as a proxy model and 2.7B parameter Cerebras-GPT model as the target model. The 40M model has a hidden size of 256, and the 2.7B model has a hidden size of 2560. Thus \(d_{proxy}=256\), \(d_{target} = 2560\), and \(m_{width}=d_{target}/d_{proxy} = 10\). These equations are found in Table 14 Appendix G of Cerebras-GPT paper.

For convenience, Cerebras Model Zoo contains the script convert_config_to_mup.py to scale GPT-3 configurations based on a target-model configuration YAML file. This script has the following parameters:

Parameter

Description

--input_yaml or -i

[Required] Configuration Yaml file of the target-model. This can be a standard configuration file, like the ones found in Cerebras Model Zoo

--base_layer_width or -d_base

[Optional] Proxy-model’s width, defaults to 256

--base_lr or -lr_base

[Optional] Proxy-model’s lr determined by hyperparameter sweep, defaults to 6e-3. Currently, we support config generation for sequential Linear learning rate schedules. First lr scheduler should perform linear warm-up and second scheduler should do linear decay.

--base_init_std or -std_base

[Optional] Proxy-model’s initial standard deviation, defaults to 0.08

--m_embed or -m_base

Proxy-model’s embeddings multiplier, defaults to 10.0

--output_yaml or -o

[Optional] Output yaml file to save the μP config. if not provided, config will be stored under the same path as the input but with a _mup tag

Implementation notes#

The default values of these parameters are based on the findings in the Cerebras-GPT paper, with a proxy model of 40M parameters.

For example, you can use this script with a configuration file of a GPT-3 model of 2.7B params, called params_gpt3_2p7b.yaml and all of the default arguments as follows:

python convert_config_to_mup.py --input_yaml </path/to/config>/params_gpt3_2p7b.yaml
muP config saved to </path/to/config>/params_gpt3_2p7b_mup.yaml

Best Practices#

When picking the best configuration of the μP hyperparameters from your hyperparameter sweep, it is recommended to pick top-10 configurations from the 40M base model with the best validation loss. Let’s say the goal is to transfer the hyperparameters to 13B model scale, it is advised to first transfer the hyperparameters to an intermediate sized model size such as 256M or 590M. At 256M or 590M scale, run training with the top-10 configs for 20 tokens per parameter and evaluate the model using a validation set. To pick the best configuration from validation loss metric, it is recommended to not just look at the lowest loss value but also the training dynamics. In our experience, for some configurations you may see very low loss values but few instabilities in the training dynamics (which manifest as loss spikes or exploding gradient norm). In that scenario it is best to pick a configuration with second or third best validation loss which exhibit stable training dynamics.

Consult the base μP configuration for the GPT-3 model, available at 40M config. This configuration was employed internally for our hyperparameter optimization processes. Subsequently, we applied μTransfer techniques to upscale to the 256M model. For different model architectures like Llama and Falcon, it’s necessary to develop a new 40M configuration that aligns with their specific structural requirements.

If there is a significant change in the model architecture such as changing position embeddings from Relative to RoPE or ALiBi, changing activation function from GeLU to SwiGLU, it is advised to redo the hyperparameter sweep at 40M scale to pick the best configs. In such cases, using the optimal set of hyperparameters from Cerebras-GPT paper is not recommended.