tf.AttentionLayer module

tf.AttentionLayer module

class tf.AttentionLayer.AttentionLayer(*args: Any, **kwargs: Any)

Bases: modelzoo.common.layers.tf.BaseLayer.BaseLayer

Multi-head attention layer. Based on MLCommons model.

Parameters
  • hidden_size (int) – Number of units in each projection output.

  • num_heads (int) – Number of attention heads.

  • use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.

  • use_ffn_bias (bool) – Whether to use bias in the output projection.

  • initializer (str) – Projection kernel intializer. Defaults to glorot_uniform.

  • kernel_regularizer (Optional[Callable]) – Projection kernel regularizer. Defaults to None.

  • bias_regularizer (Optional[Callable]) – Projection bias regularizer. Defaults to None.

  • attention_type (str) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product. Defaults to scaled_dot_product.

  • dropout_rate (float) – Dropout rate for key-query weights. Defaults to 0.0.

  • dropout_seed (int) – Seed with which to initialize the dropout layer. Defaults to None.

  • use_relative_attention_bias (bool) – Whether to use relative position bias when calculating attention.

  • num_relative_attention_buckets (int) – Used to calculate relative position bias when use_relative_attention_bias set to True.

  • bidirectional_relative_attention (bool) – Whether attention is bidirectional.

  • boundary_casting (bool) – If True, then outputs the values in half precision and casts the input values up to full precision.

  • tf_summary (bool) – If True, then saves the activations with summary_layer.

call(q, v, mask=None, past_kv=None, cache_present_kv=False, training=True, position_bias=None, cache_position_bias=False)

Applies the attention mechanism to queries q and values v. Keys will be set to be same as v.

Parameters
  • q (Tensor) – Queries, shape [batch_size, seq_length, hidden_size].

  • v (Tensor) – Values, shape [batch_size, seq_length, hidden_size].

  • mask (Tensor) – Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].

  • past_kv (Tensor) – Past keys and values. Has shape [2, batch_size, num_heads, seq_length, hidden_size / num_heads]. The tensors in [0,:,:,:,:] and [1,:,:,:,:] contain the past keys and values, respectively. Defaults to None.

  • cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

  • training (bool) – Training the model if True. Needed to call the dropout (after softmax) in the appropriate mode.

  • position_bias (Tensor) – Tensor containing position bias to apply in attention.

  • cache_position_bias (bool) – Specifies if position bias must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

Returns

when cache_present_kv is True and cache_position_bias is True, returns a tuple, where the 0th entry contains the attention output, 1st entry contains a tensor of keys and values computed at the current application of the attention layer, and the 3rd entry contains a tensor of position bias computed at the current application of the attention layer.

If cache_present_kv is False, no entry for present keys and values is provided.

If cache_position_bias is False, no entry for position bias is provided.

if both cache_present_kv cache_position_bias are set to False, return a tensor of shape equal to shape of past_kv (see above).

class tf.AttentionLayer.SelfAttentionLayer(*args: Any, **kwargs: Any)

Bases: tf.AttentionLayer.AttentionLayer

Multiheaded self-attention layer.

call(x, mask=None, past_kv=None, cache_present_kv=False, training=True, position_bias=None, cache_position_bias=False)

Applies the attention mechanism to queries q and values v. Keys will be set to be same as v.

Parameters
  • q (Tensor) – Queries, shape [batch_size, seq_length, hidden_size].

  • v (Tensor) – Values, shape [batch_size, seq_length, hidden_size].

  • mask (Tensor) – Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].

  • past_kv (Tensor) – Past keys and values. Has shape [2, batch_size, num_heads, seq_length, hidden_size / num_heads]. The tensors in [0,:,:,:,:] and [1,:,:,:,:] contain the past keys and values, respectively. Defaults to None.

  • cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

  • training (bool) – Training the model if True. Needed to call the dropout (after softmax) in the appropriate mode.

  • position_bias (Tensor) – Tensor containing position bias to apply in attention.

  • cache_position_bias (bool) – Specifies if position bias must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

Returns

when cache_present_kv is True and cache_position_bias is True, returns a tuple, where the 0th entry contains the attention output, 1st entry contains a tensor of keys and values computed at the current application of the attention layer, and the 3rd entry contains a tensor of position bias computed at the current application of the attention layer.

If cache_present_kv is False, no entry for present keys and values is provided.

If cache_position_bias is False, no entry for position bias is provided.

if both cache_present_kv cache_position_bias are set to False, return a tensor of shape equal to shape of past_kv (see above).