cerebras_pytorch.optim#

Available optimizers in the cerebras_pytorch package

optim.ASGD#

class cerebras_pytorch.optim.ASGD[source]#

ASGD optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state.

For more details, see https://dl.acm.org/citation.cfm?id=131098

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0, maximize: bool = False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

optim.Adadelta#

class cerebras_pytorch.optim.Adadelta[source]#

Adadelta optimizer implemented to perform the required pre-initialization of the optimizer state.

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0, maximize: bool = False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure: Optional[Callable] = None)#

Performs a single optimization step.

Parameters

closure – A closure that reevaluates the model and returns the loss.

optim.Adafactor#

class cerebras_pytorch.optim.Adafactor[source]#

Adafactor optimizer implemented to conform to execution within the constraints of the Cerebras WSE.

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=- 0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=False, warmup_init=False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters
  • closure (Callable, optional) – A closure that reevaluates

  • loss. (the model and returns the) –

optim.Adagrad#

class cerebras_pytorch.optim.Adagrad[source]#

Adagrad optimizer implemented to conform to execution within the constraints of the Cerebras WSE.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-2)

  • lr_decay (float, optional) – learning rate decay (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-10)

  • maximize (bool, optional) – maximize the params based on the objective, instead of minimizing (default: False)

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization: http://jmlr.org/papers/v12/duchi11a.html

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-06, maximize: bool = False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.AdamBase#

cerebras_pytorch.optim.AdamBase#

alias of <module ‘cerebras_pytorch.optim.AdamBase’ from ‘/home/docs/checkouts/readthedocs.org/user_builds/cerebras-systems-cerebras-systems-developer-documentation/checkouts/2.0.3/external/source/source_code/monolith/src/framework/cerebras_pytorch/cerebras_pytorch/optim/AdamBase.py’>

optim.Adam#

class cerebras_pytorch.optim.Adam[source]#

Adam specific overrides to AdamBase

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0, amsgrad: bool = False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

load_state_dict(state_dict)[source]#

Loads the optimizer state.

Parameters

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict.

Adds checkpoint compatibility with the Adam from PyTorch

optim.AdamW#

class cerebras_pytorch.optim.AdamW[source]#

AdamW specific overrides to AdamBase

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0, correct_bias: bool = True, amsgrad: bool = False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

load_state_dict(state_dict)[source]#

Loads the optimizer state.

Parameters

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict.

Adds checkpoint compatibility with the AdamW from HuggingFace

optim.Adamax#

class cerebras_pytorch.optim.Adamax[source]#

Adamax optimizer implemented to perform the required pre-initialization of the optimizer state.

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0, maximize: bool = False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure: Optional[Callable] = None)#

Performs a single optimization step.

Parameters

closure – A closure that reevaluates the model and returns the loss.

optim.Lamb#

class cerebras_pytorch.optim.Lamb[source]#

Implements Lamb algorithm. It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • adam (bool, optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, adam=False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.Lion#

class cerebras_pytorch.optim.Lion[source]#

Implements Lion algorithm. As proposed in Symbolic Discovery of Optimization Algorithms.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-4)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.99))

  • weight_decay (float, optional) – weight decay coefficient (default: 0)

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.parameter.Parameter], lr: float = 0.0001, betas: Tuple[float, float] = (0.9, 0.99), weight_decay: float = 0.0)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.NAdam#

class cerebras_pytorch.optim.NAdam[source]#

Implements NAdam algorithm to execute within the constraints of the Cerebras WSE, including pre-initializing optimizer state.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 2e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • momentum_decay (float, optional) – momentum momentum_decay (default: 4e-3)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used (default: None)

For further details regarding the algorithm refer to Incorporating Nesterov Momentum into Adam: https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.Parameter], lr: float = 0.002, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.RAdam#

class cerebras_pytorch.optim.RAdam[source]#

RAdam optimizer implemented to conform to execution within the constraints of the Cerebras WSE.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.Parameter], lr: float = 0.001, betas: Tuple[float, float] = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.0)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.RMSprop#

class cerebras_pytorch.optim.RMSprop[source]#

RMSprop optimizer implemented to perform the required pre-initialization of the optimizer state.

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.Rprop#

class cerebras_pytorch.optim.Rprop[source]#

Rprop optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • etas (Tuple[float, float], optional) – step size multipliers

  • step_size (Tuple[float, float], optional) – Tuple of min, max step size values. Step size is clamped to be between these values.

  • enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params: Iterable[torch.nn.Parameter], lr: float = 0.001, etas: Tuple[float, float] = (0.5, 1.2), step_sizes: Tuple[float, float] = (1e-06, 50.0))[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.SGD#

class cerebras_pytorch.optim.SGD[source]#

SGD optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False, maximize=False)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

preinitialize()[source]#

Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.

state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

step(closure=None)#

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

optim.Optimizer#

class cerebras_pytorch.optim.Optimizer[source]#

The abstract Cerebras base optimizer class.

Enforces that the preinitialize method is implemented wherein the optimizer state should be initialized ahead of time

Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

__init__(*args, enable_global_step: bool = False, **kwargs)[source]#
Parameters

enable_global_step – If True, the optimizer will keep track of the global step for each parameter.

increment_global_step(p)[source]#

Increases the global steps by 1 and returns the current value of global step tensor in torch.float32 format.

load_state_dict(state_dict)[source]#
abstract preinitialize()[source]#

The optimizer state must be initialized ahead of time in order to capture the full compute graph in the first iteration. This method must be overriden to perform the state preinitialization

state_dict(*args, **kwargs)[source]#
abstract state_names_to_sparsify()[source]#

Return the names of of per-parameter states that need to be sparsified when applying sparsity to the underlying parameters.

abstract step(closure=None)[source]#

Perform the optimizer step itself. Note, there should be no new state being created in this function. All state must be created ahead of time in preinitialize and only updated in this method.

visit_state(fn)[source]#

Applies a lambda to each stateful value.

optim helpers#

cerebras_pytorch.optim.configure_optimizer(optimizer_type: str, params, **kwargs)[source]#

Configures and requires an Optimizer specified using the provided optimizer type

The optimizer class’s signature is inspected and relevant parameters are extracted from the keyword arguments

Parameters
  • optimizer_type – The name of the optimizer to configure

  • params – The model parameters passed to the optimizer

For example,

optimizer_params = {
   "optimizer_type": "SGD",
   "lr": 0.001,
   "momentum": 0.5,
}
optimizer = cstorch.optim.configure_optimizer(
   optimizer_type=optimizer_params.pop("optimizer_type"),
   params=model.parameters(),
   **optimizer_params
)
cerebras_pytorch.optim.configure_lr_scheduler(optimizer, learning_rate, adjust_learning_rate=None)[source]#

Configures a learning rate scheduler specified using the provided lr_scheduler type

The learning rate scheduler’s class’s signature is inspected and relevant parameters are extracted from the keyword arguments

Parameters
  • optimizer – The optimizer passed to the lr_scheduler

  • learning_rate – learning rate schedule

  • adjust_learning_rate (dict) – key: layer types, val: lr scaling factor

The following list describes the possible learning_rate parameter formats:

  • learning_rate is a Python scalar (int or float)

    In this case, configure_lr_scheduler returns an instance of ConstantLR with the provided value as the constant learning rate.

  • learning_rate is a dictionary

    In this case, the dictionary is expected to contain the key scheduler which contains the name of the scheduler you want to configure.

    The rest of the parameters in the dictionary are passed in a keyword arguments to the specified schedulers init method.

  • learning_rate is a list of dictionaries

    In this case, we assume what is being configured is a SequentialLR unless the any one of the dictionaries contains the key main_scheduler and the corresponding value is ChainedLR.

    In either case, each element of the list is expected to be a dictionary that follows the format as outlines in case 2.

    If what is being configured is indeed a SequentialLR, each dictionary entry is also expected to contain the key total_iters specifying the total number of iterations each scheduler should be applied for.

Learning Rate Schedulers in cerebras_pytorch#

Available learning rate schedulers in the cerebras_pytorch package

ConstantLR

PolynomialLR

LinearLR

ExponentialLR

InverseExponentialTimeDecayLR

InverseSquareRootDecayLR

CosineDecayLR

SequentialLR

PiecewiseConstantLR

MultiStepLR

StepLR

CosineAnnealingLR

LambdaLR

CosineAnnealingWarmRestarts

MultiplicativeLR

ChainedScheduler

optim.lr_scheduler.LRScheduler#

class cerebras_pytorch.optim.lr_scheduler.LRScheduler[source]#

Cerebras specific learning rate scheduler base class.

The learning rate schedulers implemented in this file are specifically designed to be run on a Cerebras system. This means that there are certain caveats to these custom schedulers that differ from a typical LR scheduler found in core PyTorch.

The learning rate schedulers here are intended to be stepped at every iteration. This means lr_scheduler.step() should be called after every optimizer.step(). Hence, the learning rate schedulers operate on a step-by-step basis. Having said that, there are some variables used such as last_epoch that might indicate otherwise. The only reason these variables are used is to match what is used in core PyTorch. It does not indicate that things are operating on an epoch-by-epoch basis.

Also, note that the above means that our LR schedulers are incompatible with the LR schedulers found in core PyTorch. The state cannot simply be transferred between the two. So, one of the LR schedulers defined here must be used in order to have LR scheduling on the Cerebras system.

abstract _get_closed_form_lr()[source]#
__init__(optimizer, total_iters, last_epoch=- 1)[source]#
get_lr()[source]#
increment_last_epoch()[source]#

Increments the last epoch by 1

load_state_dict(state_dict)[source]#
state_dict()[source]#
step(*args, **kwargs)[source]#

Steps the scheduler and computes the latest learning rate

Only sets the last_epoch if running on CS

update_groups(values)[source]#

Update the optimizer groups with the latest learning rate values

update_last_lr()[source]#

optim.lr_scheduler.ConstantLR#

class cerebras_pytorch.optim.lr_scheduler.ConstantLR[source]#

Maintains a constant learning rate for each parameter group (no decaying).

Parameters
  • optimizer – The optimizer to schedule

  • val – The learning_rate value to maintain

  • total_iters – The number of steps to decay for

__init__(optimizer: torch.optim.Optimizer, learning_rate: float, total_iters: Optional[int] = None)[source]#

optim.lr_scheduler.PolynomialLR#

class cerebras_pytorch.optim.lr_scheduler.PolynomialLR[source]#

Decays the learning rate of each parameter group using a polynomial function in the given total_iters.

This class is similar to the Pytorch PolynomialLR LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • end_learning_rate – The final learning rate

  • total_iters – Number of steps to perform the decay

  • power – Exponent to apply to “x” (as in y=mx+b), which is ratio of step completion (1 for linear) Default: 1.0 (only Linear supported at the moment)

  • cycle – Whether to cycle

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, end_learning_rate: float, total_iters: int, power: float = 1.0, cycle: bool = False)[source]#

optim.lr_scheduler.LinearLR#

class cerebras_pytorch.optim.lr_scheduler.LinearLR[source]#

Alias for Polynomial LR scheduler with a power of 1

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, end_learning_rate: float, total_iters: int, cycle: bool = False)[source]#

optim.lr_scheduler.ExponentialLR#

class cerebras_pytorch.optim.lr_scheduler.ExponentialLR[source]#

Decays the learning rate of each parameter group by decay_rate every step.

This class is similar to the Pytorch ExponentialLR LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • total_iters – Number of steps to perform the decay

  • decay_rate – The decay rate

  • staircase – If True decay the learning rate at discrete intervals

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, total_iters: int, decay_rate: float, staircase: bool = False)[source]#

optim.lr_scheduler.InverseExponentialTimeDecayLR#

class cerebras_pytorch.optim.lr_scheduler.InverseExponentialTimeDecayLR[source]#

Decays the learning rate inverse-exponentially over time, as described in the Keras InverseTimeDecay class.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • step_exponent – Exponential value.

  • total_iters – Number of steps to perform the decay.

  • decay_rate – The decay rate.

  • staircase – If True decay the learning rate at discrete intervals.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, step_exponent: int, total_iters: int, decay_rate: float, staircase: bool = False)[source]#

optim.lr_scheduler.InverseSquareRootDecayLR#

class cerebras_pytorch.optim.lr_scheduler.InverseSquareRootDecayLR[source]#

Decays the learning rate inverse-squareroot over time, as described in the following equation:

\[\begin{aligned} lr_t & = \frac{\text{scale}}{\sqrt{\max\{t, \text{warmup_steps}\}}}. \end{aligned}\]
Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • scale – Multiplicative factor to scale the result.

  • warmup_steps – use initial_learning_rate for the first warmup_steps.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float = 1.0, scale: float = 1.0, warmup_steps: int = 1.0)[source]#

optim.lr_scheduler.CosineDecayLR#

class cerebras_pytorch.optim.lr_scheduler.CosineDecayLR[source]#

Applies the cosine decay schedule as described in the Keras CosineDecay class.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • end_learning_rate – The final learning rate

  • total_iters – Number of steps to perform the decay

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, end_learning_rate: float, total_iters: int)[source]#

optim.lr_scheduler.SequentialLR#

class cerebras_pytorch.optim.lr_scheduler.SequentialLR[source]#

Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step.

This class is a wrapper around the Pytorch SequentialLR LRS.

Parameters
  • optimizer – Wrapped optimizer

  • schedulers (list) – List of chained schedulers.

  • milestones (list) – List of integers that reflects milestone points.

  • last_epoch (int) – The index of last epoch. Default: -1.

__init__(optimizer, schedulers, milestones, last_epoch=- 1)[source]#
increment_last_epoch(*args, **kwargs)[source]#

Increments the last_epoch of the scheduler whose milestone we are on

load_state_dict(state_dict)[source]#

Loads the schedulers state. :param state_dict: scheduler state. Should be an object returned

from a call to state_dict.

state_dict()[source]#

optim.lr_scheduler.PiecewiseConstantLR#

class cerebras_pytorch.optim.lr_scheduler.PiecewiseConstantLR[source]#

Adjusts the learning rate to a predefined constant at each milestone and holds this value until the next milestone. Notice that such adjustment can happen simultaneously with other changes to the learning rate from outside this scheduler.

Parameters
  • optimizer – The optimizer to schedule

  • learning_rates – List of learning rates to maintain before/during each milestone.

  • milestones – List of step indices. Must be increasing.

__init__(optimizer, learning_rates: List[float], milestones: List[int])[source]#

optim.lr_scheduler.MultiStepLR#

class cerebras_pytorch.optim.lr_scheduler.MultiStepLR[source]#

Decays the learning rate of each parameter group by gamma once the number of steps reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

This class is similar to the Pytorch MultiStepLR LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • gamma – Multiplicative factor of learning rate decay.

  • milestones – List of step indices. Must be increasing.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, gamma: float, milestones: List[int])[source]#

optim.lr_scheduler.StepLR#

class cerebras_pytorch.optim.lr_scheduler.StepLR[source]#

Decays the learning rate of each parameter group by gamma every step_size. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

This class is similar to the Pytorch StepLR LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • step_size – Period of learning rate decay.

  • gamma – Multiplicative factor of learning rate decay.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, step_size: int, gamma: float)[source]#

optim.lr_scheduler.CosineAnnealingLR#

class cerebras_pytorch.optim.lr_scheduler.CosineAnnealingLR[source]#

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of steps since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

This class is similar to the Pytorch CosineAnnealingLR LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • T_max – Maximum number of iterations.

  • eta_min – Minimum learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, T_max: int, eta_min: float = 0.0)[source]#

optim.lr_scheduler.LambdaLR#

class cerebras_pytorch.optim.lr_scheduler.LambdaLR[source]#

Sets the learning rate of each parameter group to the initial lr times a given function (which is specified by overriding set_lr_lambda).

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float)[source]#
set_lr_lambda()[source]#

Sets learning lambda functions

optim.lr_scheduler.CosineAnnealingWarmRestarts#

class cerebras_pytorch.optim.lr_scheduler.CosineAnnealingWarmRestarts[source]#

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr, \(T_{cur}\) is the number of steps since the last restart and \(T_{i}\) is the number of steps between two warm restarts in SGDR:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right)\]

When \(T_{cur}=T_{i}\), set \(\eta_t = \eta_{min}\). When \(T_{cur}=0\) after restart, set \(\eta_t=\eta_{max}\).

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts.

This class is similar to the Pytorch CosineAnnealingWarmRestarts LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • T_0 – Number of iterations for the first restart.

  • T_mult – A factor increases Ti after a restart. Currently T_mult must be set to 1.0

  • eta_min – Minimum learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, T_0: int, T_mult: int = 1, eta_min: float = 0.0)[source]#

optim.lr_scheduler.MultiplicativeLR#

class cerebras_pytorch.optim.lr_scheduler.MultiplicativeLR[source]#

Multiply the learning rate of each parameter group by the supplied coefficient.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – The initial learning rate.

  • coefficient – Multiplicative factor of learning rate.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, coefficient: float)[source]#
set_lr_lambda()[source]#

Sets learning lambda functions

optim.lr_scheduler.ChainedScheduler#

class cerebras_pytorch.optim.lr_scheduler.ChainedScheduler[source]#

Chains list of learning rate schedulers. It takes a list of chainable learning rate schedulers and performs consecutive step() functions belonging to them by just one call.

__init__(schedulers)[source]#
load_state_dict(state_dict)[source]#

Loads the schedulers state. :param state_dict: scheduler state. Should be an object returned

from a call to state_dict.

state_dict()[source]#

optim.lr_scheduler.CyclicLR#

class cerebras_pytorch.optim.lr_scheduler.CyclicLR[source]#

Sets the learning rate of each parameter group according to cyclical learning rate policy (CLR). The policy cycles the learning rate between two boundaries with a constant frequency, as detailed in the paper Cyclical Learning Rates for Training Neural Networks. The distance between the two boundaries can be scaled on a per-iteration or per-cycle basis.

Cyclical learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.

This class has three built-in policies, as put forth in the paper:

  • “triangular”: A basic triangular cycle without amplitude scaling.

  • “triangular2”: A basic triangular cycle that scales initial amplitude by

    half each cycle.

  • “exp_range”: A cycle that scales initial amplitude by

    \(\text{gamma}^{\text{cycle iterations}}\) at each cycle iteration.

This class is similar to the Pytorch CyclicLR LRS.

Parameters
  • optimizer – The optimizer to schedule.

  • base_lr – Initial learning rate which is the lower boundary in the cycle.

  • max_lr – Upper learning rate boundaries in the cycle.

  • step_size_up – Number of training iterations in the increasing half of a cycle.

  • step_size_down – Number of training iterations in the decreasing half of a cycle.

  • mode – One of {‘triangular’, ‘triangular2’, ‘exp_range’}.

  • gamma – Constant in ‘exp_range’ scaling function: gamma**(cycle iterations).

  • scale_mode – {‘cycle’, ‘iterations’} Defines whether scale_fn is evaluated on cycle number or cycle iterations.

__init__(optimizer: torch.optim.Optimizer, base_lr: float, max_lr: float, step_size_up: int = 2000, step_size_down: Optional[int] = None, mode: str = 'triangular', gamma: float = 1.0, scale_mode: str = 'cycle')[source]#
set_scale_fn()[source]#

Sets the scaling function

optim.lr_scheduler.OneCycleLR#

class cerebras_pytorch.optim.lr_scheduler.OneCycleLR[source]#

Sets the learning rate of each parameter group according to the 1cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.

This scheduler is not chainable.

This class is similar to the Pytorch OneCycleLR LRS.

Parameters
  • optimizer – The optimizer to schedule

  • initial_learning_rate – Initial learning rate. Compared with PyTorch, this is equivalent to max_lr / div_factor.

  • max_lr – Upper learning rate boundaries in the cycle.

  • total_steps – The total number of steps in the cycle.

  • pct_start – The percentage of the cycle (in number of steps) spent increasing the learning rate.

  • final_div_factor – Determines the minimum learning rate via min_lr = initial_lr/final_div_factor.

  • three_phase – If True, use a third phase of the schedule to annihilate the learning rate

  • anneal_strategy – Specifies the annealing strategy: “cos” for cosine annealing, “linear” for linear annealing.

__init__(optimizer: torch.optim.Optimizer, initial_learning_rate: float, max_lr: float, total_steps: int = 1000, pct_start: float = 0.3, final_div_factor: float = 10000.0, three_phase: bool = False, anneal_strategy: str = 'cos')[source]#