common.pytorch.optim package#

Subpackages#

common.pytorch.optim.sparse package

Submodules#

common.pytorch.optim.ASGD module#

common.pytorch.optim.Adadelta module#

common.pytorch.optim.Adafactor module#

common.pytorch.optim.Adagrad module#

common.pytorch.optim.AdamBase module#

class common.pytorch.optim.AdamBase.Adam#

Bases: common.pytorch.optim.AdamBase.AdamBase

__init__(params: Iterable[torch.nn.parameter.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0, amsgrad: bool = False)#

load_state_dict(state_dict)#

Loads the optimizer state.

Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

Adds checkpoint compatibility with the Adam from PyTorch

class common.pytorch.optim.AdamBase.AdamBase#

Bases: modelzoo.common.pytorch.optim.CSOptimizer.CSOptimizer

AdamW optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initilizing optimizer state and performing a gradual reduction of bias correction using exponential decay of beta1_power and beta2_power rather than recomputing beta1^step each step.

__init__(params: Iterable[torch.nn.parameter.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0, l2_regularization_rate: float = 0.0, correct_bias: bool = True, amsgrad: bool = False)#

convert_state_dict_for_checkpoint(state_dict)#

Converts the state_dict for compatibility with AdamW from huggingface_common, which is the optimizer used by PyTorchBaseModel when not run on WSE and is otherwise API compatible.

Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

Returns the modified state_dict.

load_state_dict(state_dict)#

Loads the optimizer state.

Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

This overrides torch.optim.Optimizer to add checkpoint compatibility with the AdamW from huggingface_common, which is otherwise API compatible.

preinitialize()#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()#

step(closure=None)#

Performs a single optimization step.

Parameters: closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class common.pytorch.optim.AdamBase.AdamW#

Bases: common.pytorch.optim.AdamBase.AdamBase

__init__(params: Iterable[torch.nn.parameter.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0, correct_bias: bool = True, amsgrad: bool = False)#

load_state_dict(state_dict)#

Loads the optimizer state.

Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

Adds checkpoint compatibility with the AdamW from HuggingFace

common.pytorch.optim.Adamax module#

common.pytorch.optim.CSOptimizer module#

Abstract base class for Cerebras Optimizers.

class common.pytorch.optim.CSOptimizer.CSOptimizer#

Bases: torch.optim.Optimizer, abc.ABC

Cerebras Base Optimizer class

Cerebras Base Optimizer class handles preinitialization of optimizer states for non-CS runs, making the implementation of the optimizer compatible with both CS and non-CS runs. It also preinitializes global steps tensor and provides a method to retrieve the global steps.

__init__(params, defaults, enable_global_step=False)#: Cerebras Base Optimizer class handles preinitialization of optimizer states for non-CS runs, making the implementation of the optimizer compatible with both CS and non-CS runs. It also preinitializes global steps tensor and provides a method to retrieve the global steps.

load_state_dict(state_dict)#

post_load_state_dict()#: Actions to perform after initializing state and loading the state dict

abstract preinitialize()#: Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

abstract state_names_to_sparsify()#: Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

abstract step(closure=None)#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

to(device=None)#

Moves optimizer state onto specified device or onto corresponding: parameter’s device if no device is specified.

Parameters

device (optional) – Device to move state tensors to. If not specified,
used. (the corresponding parameter's device will be) –

Returns

self

common.pytorch.optim.Lamb module#

common.pytorch.optim.NAdam module#

common.pytorch.optim.RAdam module#

common.pytorch.optim.RMSprop module#

common.pytorch.optim.Rprop module#

common.pytorch.optim.SGD module#

common.pytorch.optim.lr_scheduler module#

class common.pytorch.optim.lr_scheduler.ChainedScheduler#

Bases: torch.optim.lr_scheduler.ChainedScheduler

Chains list of learning rate schedulers. It takes a list of chainable learning rate schedulers and performs consecutive step() functions belonging to them by just one call.

__init__(*args, **kwargs)#

lr_function(global_step)#

state_dict()#

step()#

class common.pytorch.optim.lr_scheduler.ConstantLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Maintains a constant learning rate for each parameter group (no decaying).

Parameters

optimizer – The optimizer to schedule
val – The learning_rate value to maintain
decay_steps – The number of steps to decay for

__init__(optimizer: torch.optim.Optimizer, val: float, decay_steps: Optional[int] = None, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.CosineAnnealingLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of steps since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

This class is similar to the Pytorch CosineAnnealingLR LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
T_max – Maximum number of iterations.
eta_min – Minimum learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, T_max: int, eta_min: float, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.CosineAnnealingWarmRestarts#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right)\]

When \(T_{cur}=T_{i}\), set \(\eta_t = \eta_{min}\). When \(T_{cur}=0\) after restart, set \(\eta_t=\eta_{max}\).

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts.

This class is similar to the Pytorch CosineAnnealingWarmRestarts LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
T_0 – Number of iterations for the first restart.
T_mult – A factor increases Ti after a restart. Currently T_mult must be set to 1.0
eta_min – Minimum learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, T_0: int, T_mult: int, eta_min: float, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.CosineDecayLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Applies the cosine decay schedule as described in the Keras CosineDecay class.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
end_learning_rate – The final learning rate
decay_steps – Number of steps to perform the decay

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, end_learning_rate: float, decay_steps: int, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.CyclicLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR). The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper:

“triangular”: A basic triangular cycle without amplitude scaling.
“triangular2”: A basic triangular cycle that scales initial amplitude by half each cycle.
“exp_range”: A cycle that scales initial amplitude by \(\text{gamma}^{\text{cycle iterations}}\) at each cycle iteration.

This class is similar to the Pytorch CyclicLR LRS.

Parameters

optimizer – The optimizer to schedule.
base_lr – Initial learning rate which is the lower boundary in the cycle.
max_lr – Upper learning rate boundaries in the cycle.
step_size_up – Number of training iterations in the increasing half of a cycle.
step_size_down – Number of training iterations in the decreasing half of a cycle.
mode – One of {‘triangular’, ‘triangular2’, ‘exp_range’}.
gamma – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations).
scale_mode – {‘cycle’, ‘iterations’} Defines whether scale_fn is evaluated on cycle number or cycle iterations.

__init__(optimizer: torch.optim.Optimizer, base_lr: float, max_lr: float, step_size_up: int, step_size_down: int, mode: <module 'string' from '/home/docs/.pyenv/versions/3.7.9/lib/python3.7/string.py'>, gamma: float, scale_mode: <module 'string' from '/home/docs/.pyenv/versions/3.7.9/lib/python3.7/string.py'>, disable_lr_steps_reset: bool = False)#

set_scale_fn()#

class common.pytorch.optim.lr_scheduler.ExponentialLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Decays the learning rate of each parameter group by decay_rate every step.

This class is similar to the Pytorch ExponentialLR LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
decay_steps – Number of steps to perform the decay
decay_rate – The decay rate
staircase – If True decay the learning rate at discrete intervals

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, decay_steps: int, decay_rate: float, staircase: bool = False, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.InverseExponentialTimeDecayLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Decays the learning rate inverse-exponentially over time, as described in the Keras InverseTimeDecay class.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
step_exponent – Exponential value.
decay_steps – Number of steps to perform the decay.
decay_rate – The decay rate.
staircase – If True decay the learning rate at discrete intervals.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, step_exponent: int, decay_steps: int, decay_rate: float, staircase: bool = False, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.InverseSquareRootDecayLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Decays the learning rate inverse-squareroot over time, as described in the following equation:

\[\begin{aligned} lr_t & = \frac{\text{scale}}{\sqrt{\max\{t, \text{warmup_steps}\}}}. \end{aligned}\]

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
scale – Multiplicative factor to scale the result.
warmup_steps – use initial_learning_rate for the first warmup_steps.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, scale: float, warmup_steps: int, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.LRScheduler#

Bases: torch.optim.lr_scheduler.LambdaLR, abc.ABC

Cerebras specific learning rate scheduler base class.

The learning rate schedulers implemented in this file are specifically designed to be run on a Cerebras system. This means that there are certain caveats to these custom schedulers that differ from a typical LR scheduler found in core PyTorch.

The learning rate schedulers here are intended to be stepped at every iteration. This means lr_scheduler.step() should be called after every optimizer.step(). Hence, the learning rate schedulers operate on a step-by-step basis. Having said that, there are some variables used such as last_epoch that might indicate otherwise. The only reason these variables are used is to match what is used in core PyTorch. It does not indicate that things are operating on an epoch-by-epoch basis.

Also, note that the above means that our LR schedulers are incompatible with the LR schedulers found in core PyTorch. The state cannot simply be transferred between the two. So, one of the LR schedulers defined here must be used in order to have LR scheduling on the Cerebras system.

__init__(optimizer, decay_steps: Optional[int] = None, disable_lr_steps_reset: bool = False)#

get_lr()#

global_start_step = 0#

initial_epoch = 0#

load_state_dict(state_dict: dict)#

lr_function(global_step)#

state_dict()#

step(*args, **kwargs)#

Steps the scheduler and computes the latest learning rate

Only sets the last_epoch if running on CS

class common.pytorch.optim.lr_scheduler.LambdaLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Sets the learning rate of each parameter group to the initial lr times a given function (which is specified by overriding set_lr_lambda).

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, disable_lr_steps_reset: bool = False)#

set_lr_lambda()#

class common.pytorch.optim.lr_scheduler.MultiStepLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Decays the learning rate of each parameter group by gamma once the number of steps reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

This class is similar to the Pytorch MultiStepLR LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
gamma – Multiplicative factor of learning rate decay.
milestones – List of step indices. Must be increasing.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, gamma: float, milestones: List[int], disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.MultiplicativeLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Multiply the learning rate of each parameter group by the supplied coefficient.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
coefficient – Multiplicative factor of learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, coefficient: float, disable_lr_steps_reset: bool = False)#

set_lr_lambda()#

class common.pytorch.optim.lr_scheduler.OneCycleLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Sets the learning rate of each parameter group according to the 1cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

This scheduler is not chainable.

This class is similar to the Pytorch OneCycleLR LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – Initial learning rate. Compared with PyTorch, this is equivalent to max_lr / div_factor.
max_lr – Upper learning rate boundaries in the cycle.
total_steps – The total number of steps in the cycle.
pct_start – The percentage of the cycle (in number of steps) spent increasing the learning rate.
final_div_factor – Determines the minimum learning rate via min_lr = initial_lr/final_div_factor.
three_phase – If True, use a third phase of the schedule to annihilate the learning rate
anneal_strategy – Specifies the annealing strategy: “cos” for cosine annealing, “linear” for linear annealing.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, max_lr: float, total_steps: int, pct_start: float, final_div_factor: float, three_phase: bool, anneal_strategy: <module 'string' from '/home/docs/.pyenv/versions/3.7.9/lib/python3.7/string.py'>, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.PiecewiseConstantLR#

Bases: common.pytorch.optim.lr_scheduler.SequentialLR

Adjusts the learning rate to a predefined constant at each milestone and holds this value until the next milestone. Notice that such adjustment can happen simultaneously with other changes to the learning rate from outside this scheduler.

Parameters

optimizer – The optimizer to schedule
learning_rates – List of learning rates to maintain before/during each milestone.
milestones – List of step indices. Must be increasing.

__init__(optimizer, learning_rates: List[float], milestones: List[int], disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.PolynomialLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Decays the learning rate of each parameter group using a polynomial function in the given decay_steps.

This class is similar to the Pytorch PolynomialLR LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
end_learning_rate – The final learning rate
decay_steps – Number of steps to perform the decay
power – Exponent to apply to “x” (as in y=mx+b), which is ratio of step completion (1 for linear) Default: 1.0 (only Linear supported at the moment)
cycle – Whether to cycle

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, end_learning_rate: float, decay_steps: int, power: float = 1.0, cycle: bool = False, disable_lr_steps_reset: bool = False)#

class common.pytorch.optim.lr_scheduler.SequentialLR#

Bases: torch.optim.lr_scheduler.SequentialLR

Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step.

This class is a wrapper around the Pytorch SequentialLR LRS.

Parameters

optimizer – Wrapped optimizer
schedulers (list) – List of chained schedulers.
milestones (list) – List of integers that reflects milestone points.
last_epoch (int) – The index of last epoch. Default: -1.

__init__(optimizer, *args, **kwargs)#

load_state_dict(state_dict: dict)#

lr_function(global_step)#

state_dict()#

step()#

class common.pytorch.optim.lr_scheduler.StepLR#

Bases: common.pytorch.optim.lr_scheduler.LRScheduler

Decays the learning rate of each parameter group by gamma every step_size. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

This class is similar to the Pytorch StepLR LRS.

Parameters

optimizer – The optimizer to schedule
initial_learning_rate – The initial learning rate.
step_size – Period of learning rate decay.
gamma – Multiplicative factor of learning rate decay.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, step_size: int, gamma: float, disable_lr_steps_reset: bool = False)#

common.pytorch.optim.utils module#

Module contents#

common.pytorch.model_utils.checkpoint_converters package

common.pytorch.optim.sparse package