Train with sparse inputs#
Note
Computations with variable sequence length are only available in pipeline execution
Variable Sequence Length (VSL) is a feature that allows computations on the CS system running in pipeline mode to process tensors which vary in shape from one element of a batch to the next.
In natural language processing applications, it is common for input data to consist of sequences of heterogeneous length. When this is the case, short samples get padded up to a user defined maximum sequence length so that they can be batched together. Naive treatment of padded data can result in wasting significant computation on padding tokens. VSL allows users to strip away this padding as samples enter the wafer and perform the model computation on non-padded variable length tensors. This leads to less wasted computation and faster training times. Typically models written for GPU include logic to ensure that padding tokens do not contribute to the final loss of the model. In this case, enabling VSL will have no affect on the model’s function other than increasing throughput.
TensorFlow Variable Sequence Length#
VSL is enabled through a combination of using certain pad id values in input tensors and setting flags in the Cerebras configuration according to the following rules.
You must set
config.matching.kernel.use_legacy_vsl = True
in the Cerebras configuration.If an input tensor is first used in an embedding layer, then that input needs to be padded with a value that isn’t used for any purpose other than padding, and that padding value must be supplied to the Cerebras embedding layer implementation. Supplying a pad value of -1 to the attention layer signifies that VSL should not be used. When the embedding kernel sees a pad value other than -1, it strips away all tokens at the end of the input sequence it receives that start with the pad value. For example,if you want to enable VSL in a BERT model, then you could pad the
input_ids
tensor with-100
and specify to the embedding layer that thepad_id
is-100
. You would need to do similar for the other inputs to the model that are fed directly to embedding layers.For a labels tensor fed directly into an MLM loss computation, this tensor must be padded with -1.
As long as the pad_id
is correctly specified in any embedding layers, there is no change necessary to model code to enable VSL, only the above changes to data and the Cerebras config.
Limitations#
As mentioned above, VSL is limited in its generality. Accordingly, if a model does not compile with VSL turned on, we suggest attempting a compile without VSL. This feature is also prone to user bugs involving improperly set pad ids or not fully enabling VSL for all inputs of the model. If the pad_id
specified is used in some part of the input other than padding or if the lengths of input tensors or label tensors after removing padding don’t align, the model will compile without error but will stall during runtime.
PyTorch Variable Tensor Shape#
The VTS interface consists of two custom PyTorch ops: cerebras.framework.torch.nn.StripPadding
and cerebras.framework.torch.nn.RestorePadding
. The StripPadding
function accepts the following arguments:
input
: The input tensor to process.mask
: A mask that defines which portions of the input correspond to padding and should be stripped away. In particular, this mask defines where the end of the sequence is.axis
: The axis along which to strip away part ofinput
.
When not running on CS system, this operation does nothing. On CS system, it produces a version of input
with the end of the tensor stripped away along axis axis
. The end of the tensor is defined by the first element of mask
that is either 0 or False
.
The RestorePadding
function is not used as commonly as the StripPadding
function but is useful in some cases when the user wants to ignore padding values only in some subset of a model. It accepts the following arguments:
input
: The tensor to add padding back into.axis
: The axis along which to add in padding.pad_value
: A scalar value that will be used to pad the input tensor to the maximum shape.
As with StripPadding
, this operation is the identity function when not run on a CS system. On a CS system, it performs the inverse operation to StripPadding
. That is, it takes a variable-shape tensor and pads it out to its full shape. This input tensor can be the output of a StripPadding
call or a tensor derived from passing the output of a StripPadding
call through certain VTS compatible operations. The Cerebras compiler stack infers the maximum shape of input
using information from the shape of the input to the StripPadding
operation from which input
was derived. It then pads out input
along axis axis
using value pad_value
to the shape derived by the compile stack.
Example Usage in a Model#
Using the StripPadding
and RestorePadding
functions, it is easy to convert an existing model to use VTS. For example, you can enable VTS in a BERT model as follows:
class BertModel(torch.nn.Module):
# model initialization code
def forward(
self,
input_ids,
attention_mask,
masked_lm_positions,
masked_lm_weights,
mlm_labels,
):
input_ids = StripPadding(input_ids, attention_mask)
masked_lm_positions = StripPadding(masked_lm_positions, masked_lm_weights)
labels = StripPadding(mlm_labels, masked_lm_weights)
# remaining model code
Limitations#
Variable Tensor Shape is a feature that is still maturing, and as such has several limitations in its current form. If a model fails Cerebras compile with VTS turned on, we suggest attempting a compile without VTS. The only supported axis for VTS is axis=1
, which corresponds to a variable sequence dimension for common language modeling applications.