cerebras.modelzoo.common.utils.model.transformer_utils.create_broadcasted_autoregressive_mask#

cerebras.modelzoo.common.utils.model.transformer_utils.create_broadcasted_autoregressive_mask(batch_size: int, num_heads: int, tgt_seq_length: int, attention_span: Optional[torch.Tensor] = None, sliding_window_length: Optional[int] = None, device: Optional[torch.device] = None, dtype: torch.dtype = torch.float16, multiply_neg_inf: bool = True)[source]#

Create broadcasted causal attention mask optionally with vsl masking.

For vsl attention_span is required and past tokens out of the current sequence are additionally masked.

Parameters
  • batch_size (int) – Batch size.

  • num_heads (int) – Number of heads.

  • tgt_seq_length (int) – Target sequence length.

  • attention_span (torch.Tensor) – attention span of keys for VSL has shape [batch_size, target_seq_len].

  • sliding_window_length (int) – If specified, the current token would only attend the current

  • tokens. (token and sliding_window_length previous) –

  • device (torch.device) – The device of the input to the model, used for causal mask creation.

  • dtype (torch.dtype) – Dtype of the resulting mask, default to torch.float16.

  • multiply_neg_inf (bool) – whether to multiply the resulting mask by a negative infinity constant, default to True.

Returns

The attention mask of shape [batch_size, num_heads, tgt_seq_len, tgt_seq_len].