common.tf.optimizers package#

Submodules#

common.tf.optimizers.AdamWOptimizer module#

class common.tf.optimizers.AdamWOptimizer.AdamWOptimizer#

Bases: tensorflow.compat.v1.train.Optimizer

Adam Weight Decay optimizer (AdamW). Based on: https://github.com/google-research/bert/blob/master/optimization.py

__init__(learning_rate, weight_decay_rate=0.0, beta1=0.9, beta2=0.999, epsilon=1e-06, use_bias_correction=False, exclude_from_weight_decay=None, use_locking=False, name='AdamW')#

common.tf.optimizers.GradAccumOptimizer module#

class common.tf.optimizers.GradAccumOptimizer.GradAccumOptimizer#

Bases: tensorflow.compat.v1.train.Optimizer

Gradient accumulation optimizer. Wraps the provided optimizer by adding gradient accumulation functionality.

This functionality enables a training mode where gradients are accumulated over every grad_accum_steps. At every grad_accum_steps step, the accumulated gradients are used to update the weights. After that the gradients are set to zero, and the process continues. This is equivalent to training with the batch size of grad_accum_steps*batch_size.

This gives a possibility to train models with any batch size on a single GPU, provided that they fit into memory with batch size 1.

Initializes a GradAccumOPtimizer.

Parameters

optimizer (tf.compat.v1.train.Optimizer) – Optimizer to be wrapped into gradient accumulation.
grad_accum_steps (int) – Number of gradient accumulation steps.

__init__(optimizer, grad_accum_steps)#

Initializes a GradAccumOPtimizer.

Parameters

optimizer (tf.compat.v1.train.Optimizer) – Optimizer to be wrapped into gradient accumulation.
grad_accum_steps (int) – Number of gradient accumulation steps.

apply_gradients(grads_and_vars, global_step=None, name=None)#: Apply gradients to variables. See tf.compat.v1.train.Optimizer for description of input arguments.

compute_gradients(loss, var_list=None, gate_gradients=tensorflow.compat.v1.train.Optimizer.GATE_OP, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)#: Computes and accumulates gradients. Returns a list of (gradient, variable) pairs, where the new gradient is a sum of the gradient for the current batch and gradient values accumulated so far. The accumulated values are set to zero every grad_accum_steps. See tf.compat.v1.train.Optimizer for description of input arguments.

property loss_scale#

common.tf.optimizers.LossScale module#

Loss scaling

class common.tf.optimizers.LossScale.CSDynamicLossScale#

Bases: tensorflow.compat.v1.train.experimental.DynamicLossScale

Loss scale that dynamically adjusts itself.

Dynamic loss scaling works by adjusting the loss scale as training progresses. The goal is to keep the loss scale as high as possible without overflowing the gradients. As long as the gradients do not overflow, raising the loss scale never hurts.

The algorithm starts by setting the loss scale to an initial value. Every N steps that the gradients are finite, the loss scale is increased by some factor. However, if a NaN or Inf gradient is found, the gradients for that step are not applied, and the loss scale is decreased by the factor. This process tends to keep the loss scale as high as possible without gradients overflowing.

Creates the dynamic loss scale.

:paramparam float initial_loss_scale: The loss scale to use at the: beginning. It’s better to start this at a very high number, because a loss scale that is too high gets lowered far more quickly than a loss scale that is to low gets raised. The default is 2 ** 15, which is approximately half the maximum float16 value.
:paramparam int increment_period: Increases loss scale every increment_period: consecutive steps that finite gradients are encountered. If a nonfinite gradient is encountered, the count is reset back to zero.
:paramparam float multiplier: The multiplier to use when increasing or: decreasing the loss scale.

:param : param float min_loss_scale: Smallest possible loss scale value. :param : param float max_loss_scale: Largest possible loss scale value. :param : param float overflow_tolerance: Overflow tolerance.

__init__(initial_loss_scale=32768.0, increment_period=2000, multiplier=2.0, min_loss_scale=6.103515625e-05, max_loss_scale=32768.0, overflow_tolerance=0.05)#

Creates the dynamic loss scale.

:paramparam float initial_loss_scale: The loss scale to use at the: beginning. It’s better to start this at a very high number, because a loss scale that is too high gets lowered far more quickly than a loss scale that is to low gets raised. The default is 2 ** 15, which is approximately half the maximum float16 value.
:paramparam int increment_period: Increases loss scale every increment_period: consecutive steps that finite gradients are encountered. If a nonfinite gradient is encountered, the count is reset back to zero.
:paramparam float multiplier: The multiplier to use when increasing or: decreasing the loss scale.

get_config()#

update(unscaled_grads_vars)#: dynamically update the loss scaling

class common.tf.optimizers.LossScale.MixedPrecisionLossScaleOptimizerAdapter#

Bases: tensorflow.python.training.experimental.loss_scale_optimizer.MixedPrecisionLossScaleOptimizer

__init__(opt, loss_scale)#

property loss_scale#

property optimizer#

common.tf.optimizers.LossScale.wrap_optimizer(opt, loss_scale)#: Wraps an optimizer with a LossScaleOptimizer.

common.tf.optimizers.Trainer module#

The trainer class.

class common.tf.optimizers.Trainer.Trainer#

Bases: object

The trainer class that builds train ops based on the given configuration parameters.

Parameters

params (dict) – Trainer configuration parameters.
tf_summary (bool) – Summaries flag.
mixed_precision (bool) – Mixed precision flag.

__init__(params, tf_summary=False, mixed_precision=False)#

build_optimizer()#

Setup the optimizer.

Returns: The optimizer

build_train_ops(loss)#

Setup optimizer and build train ops.

Parameters: loss (Tensor) – The loss tensor
Returns: Train ops

clip_gradients(grads_vars, global_norm=None)#

Performs basic gradient clipping:

by global norm if self._max_gradient_norm is set
by value if self._max_gradient_value is set, to the symmetric range (-self._max_gradient_value, self._max_gradient_value)

Parameters: grads_vars (Tensor) – List of (grad, var) tuples

get_learning_rate()#

Define the learning rate schedule. Currently supports: - constant - exponential - linear - polynomial - piecewise constant - inverse exponential time decay (not supported natively)

learning_rate can be specified in yaml as: - a single float for a constant learning rate - a dict representing a single decay schedule - a list of dicts (for a series of decay schedules)

Returns: the learning rate tensor

property grad_accum_steps#

property gradient_global_norm#

is_grad_accum()#

property log_summaries#

log_training_summaries(tensor, name, family)#

Make summaries for training. Plotting summaries for - Sparsity of tensor - Histogram of tensor (on log scale) - Denormals in tensor - Norm of tensor

Parameters

tensor (Tensor) – tensor to plot summaries for
name (str) – name of the tensor to plot summaries for
family (str) – family that the tensor belongs to (kernel / bias)

property loss_scale_value#

property optimizer#

uses_dynamic_loss_scaling()#

uses_loss_scaling()#

uses_static_loss_scaling()#

Module contents#

common.tf.model_utils package

Cerebras Model Zoo YAML params