common.tf.optimizers package#
Submodules#
common.tf.optimizers.AdamWOptimizer module#
- class common.tf.optimizers.AdamWOptimizer.AdamWOptimizer#
Bases:
tensorflow.compat.v1.train.Optimizer
Adam Weight Decay optimizer (AdamW). Based on: https://github.com/google-research/bert/blob/master/optimization.py
- __init__(learning_rate, weight_decay_rate=0.0, beta1=0.9, beta2=0.999, epsilon=1e-06, use_bias_correction=False, exclude_from_weight_decay=None, use_locking=False, name='AdamW')#
common.tf.optimizers.GradAccumOptimizer module#
- class common.tf.optimizers.GradAccumOptimizer.GradAccumOptimizer#
Bases:
tensorflow.compat.v1.train.Optimizer
Gradient accumulation optimizer. Wraps the provided optimizer by adding gradient accumulation functionality.
This functionality enables a training mode where gradients are accumulated over every grad_accum_steps. At every grad_accum_steps step, the accumulated gradients are used to update the weights. After that the gradients are set to zero, and the process continues. This is equivalent to training with the batch size of grad_accum_steps*batch_size.
This gives a possibility to train models with any batch size on a single GPU, provided that they fit into memory with batch size 1.
Initializes a GradAccumOPtimizer.
- Parameters
optimizer (tf.compat.v1.train.Optimizer) – Optimizer to be wrapped into gradient accumulation.
grad_accum_steps (int) – Number of gradient accumulation steps.
- __init__(optimizer, grad_accum_steps)#
Initializes a GradAccumOPtimizer.
- Parameters
optimizer (tf.compat.v1.train.Optimizer) – Optimizer to be wrapped into gradient accumulation.
grad_accum_steps (int) – Number of gradient accumulation steps.
- apply_gradients(grads_and_vars, global_step=None, name=None)#
Apply gradients to variables. See tf.compat.v1.train.Optimizer for description of input arguments.
- compute_gradients(loss, var_list=None, gate_gradients=tensorflow.compat.v1.train.Optimizer.GATE_OP, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)#
Computes and accumulates gradients. Returns a list of (gradient, variable) pairs, where the new gradient is a sum of the gradient for the current batch and gradient values accumulated so far. The accumulated values are set to zero every grad_accum_steps. See tf.compat.v1.train.Optimizer for description of input arguments.
- property loss_scale#
common.tf.optimizers.LossScale module#
Loss scaling
- class common.tf.optimizers.LossScale.CSDynamicLossScale#
Bases:
tensorflow.compat.v1.train.experimental.DynamicLossScale
Loss scale that dynamically adjusts itself.
Dynamic loss scaling works by adjusting the loss scale as training progresses. The goal is to keep the loss scale as high as possible without overflowing the gradients. As long as the gradients do not overflow, raising the loss scale never hurts.
The algorithm starts by setting the loss scale to an initial value. Every N steps that the gradients are finite, the loss scale is increased by some factor. However, if a NaN or Inf gradient is found, the gradients for that step are not applied, and the loss scale is decreased by the factor. This process tends to keep the loss scale as high as possible without gradients overflowing.
Creates the dynamic loss scale.
- :paramparam float initial_loss_scale: The loss scale to use at the
beginning. It’s better to start this at a very high number, because a loss scale that is too high gets lowered far more quickly than a loss scale that is to low gets raised. The default is 2 ** 15, which is approximately half the maximum float16 value.
- :paramparam int increment_period: Increases loss scale every increment_period
consecutive steps that finite gradients are encountered. If a nonfinite gradient is encountered, the count is reset back to zero.
- :paramparam float multiplier: The multiplier to use when increasing or
decreasing the loss scale.
:param : param float min_loss_scale: Smallest possible loss scale value. :param : param float max_loss_scale: Largest possible loss scale value. :param : param float overflow_tolerance: Overflow tolerance.
- __init__(initial_loss_scale=32768.0, increment_period=2000, multiplier=2.0, min_loss_scale=6.103515625e-05, max_loss_scale=32768.0, overflow_tolerance=0.05)#
Creates the dynamic loss scale.
- :paramparam float initial_loss_scale: The loss scale to use at the
beginning. It’s better to start this at a very high number, because a loss scale that is too high gets lowered far more quickly than a loss scale that is to low gets raised. The default is 2 ** 15, which is approximately half the maximum float16 value.
- :paramparam int increment_period: Increases loss scale every increment_period
consecutive steps that finite gradients are encountered. If a nonfinite gradient is encountered, the count is reset back to zero.
- :paramparam float multiplier: The multiplier to use when increasing or
decreasing the loss scale.
:param : param float min_loss_scale: Smallest possible loss scale value. :param : param float max_loss_scale: Largest possible loss scale value. :param : param float overflow_tolerance: Overflow tolerance.
- get_config()#
- update(unscaled_grads_vars)#
dynamically update the loss scaling
- class common.tf.optimizers.LossScale.MixedPrecisionLossScaleOptimizerAdapter#
Bases:
tensorflow.python.training.experimental.loss_scale_optimizer.MixedPrecisionLossScaleOptimizer
- __init__(opt, loss_scale)#
- property loss_scale#
- property optimizer#
- common.tf.optimizers.LossScale.wrap_optimizer(opt, loss_scale)#
Wraps an optimizer with a LossScaleOptimizer.
common.tf.optimizers.Trainer module#
The trainer class.
- class common.tf.optimizers.Trainer.Trainer#
Bases:
object
The trainer class that builds train ops based on the given configuration parameters.
- Parameters
params (dict) – Trainer configuration parameters.
tf_summary (bool) – Summaries flag.
mixed_precision (bool) – Mixed precision flag.
- __init__(params, tf_summary=False, mixed_precision=False)#
- build_optimizer()#
Setup the optimizer.
- Returns
The optimizer
- build_train_ops(loss)#
Setup optimizer and build train ops.
- Parameters
loss (Tensor) – The loss tensor
- Returns
Train ops
- clip_gradients(grads_vars, global_norm=None)#
- Performs basic gradient clipping:
by global norm if self._max_gradient_norm is set
by value if self._max_gradient_value is set, to the symmetric range (-self._max_gradient_value, self._max_gradient_value)
- Parameters
grads_vars (Tensor) – List of
(grad, var)
tuples
- get_learning_rate()#
Define the learning rate schedule. Currently supports: - constant - exponential - linear - polynomial - piecewise constant - inverse exponential time decay (not supported natively)
learning_rate can be specified in yaml as: - a single float for a constant learning rate - a dict representing a single decay schedule - a list of dicts (for a series of decay schedules)
- Returns
the learning rate tensor
- property grad_accum_steps#
- property gradient_global_norm#
- is_grad_accum()#
- property log_summaries#
- log_training_summaries(tensor, name, family)#
Make summaries for training. Plotting summaries for - Sparsity of tensor - Histogram of tensor (on log scale) - Denormals in tensor - Norm of tensor
- Parameters
tensor (Tensor) – tensor to plot summaries for
name (str) – name of the tensor to plot summaries for
family (str) – family that the tensor belongs to (kernel / bias)
- property loss_scale_value#
- property optimizer#
- uses_dynamic_loss_scaling()#
- uses_loss_scaling()#
- uses_static_loss_scaling()#