tf.AttentionLayer module
tf.AttentionLayer module¶
- class tf.AttentionLayer.AttentionLayer(*args: Any, **kwargs: Any)¶
Bases:
modelzoo.common.layers.tf.BaseLayer.BaseLayer
Multi-head attention layer. Based on MLCommons model.
- Parameters
hidden_size (int) – Number of units in each projection output.
num_heads (int) – Number of attention heads.
use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.
use_ffn_bias (bool) – Whether to use bias in the output projection.
initializer (str) – Projection kernel intializer. Defaults to
glorot_uniform
.kernel_regularizer (Optional[Callable]) – Projection kernel regularizer. Defaults to
None
.bias_regularizer (Optional[Callable]) – Projection bias regularizer. Defaults to
None
.attention_type (str) – The attention variant to execute. Currently accepts
dot_product
andscaled_dot_product
. Defaults toscaled_dot_product
.dropout_rate (float) – Dropout rate for key-query weights. Defaults to 0.0.
dropout_seed (int) – Seed with which to initialize the dropout layer. Defaults to
None
.use_relative_attention_bias (bool) – Whether to use relative position bias when calculating attention.
num_relative_attention_buckets (int) – Used to calculate relative position bias when use_relative_attention_bias set to True.
bidirectional_relative_attention (bool) – Whether attention is bidirectional.
boundary_casting (bool) – If
True
, then outputs the values in half precision and casts the input values up to full precision.tf_summary (bool) – If
True
, then saves the activations withsummary_layer
.
- call(q, v, mask=None, past_kv=None, cache_present_kv=False, training=True, position_bias=None, cache_position_bias=False)¶
Applies the attention mechanism to queries
q
and valuesv
. Keys will be set to be same asv
.- Parameters
q (Tensor) – Queries, shape
[batch_size, seq_length, hidden_size]
.v (Tensor) – Values, shape
[batch_size, seq_length, hidden_size]
.mask (Tensor) – Attention mask. Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
.past_kv (Tensor) – Past keys and values. Has shape
[2, batch_size, num_heads, seq_length, hidden_size / num_heads]
. The tensors in[0,:,:,:,:]
and[1,:,:,:,:]
contain the past keys and values, respectively. Defaults toNone
.cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to
False
.training (bool) – Training the model if
True
. Needed to call thedropout
(after softmax) in the appropriate mode.position_bias (Tensor) – Tensor containing position bias to apply in attention.
cache_position_bias (bool) – Specifies if position bias must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to
False
.
- Returns
when
cache_present_kv
isTrue
andcache_position_bias
isTrue
, returns a tuple, where the 0th entry contains the attention output, 1st entry contains a tensor of keys and values computed at the current application of the attention layer, and the 3rd entry contains a tensor of position bias computed at the current application of the attention layer.If
cache_present_kv
isFalse
, no entry for present keys and values is provided.If
cache_position_bias
isFalse
, no entry for position bias is provided.if both
cache_present_kv
cache_position_bias
are set to False, return a tensor of shape equal to shape ofpast_kv
(see above).
- class tf.AttentionLayer.SelfAttentionLayer(*args: Any, **kwargs: Any)¶
Bases:
tf.AttentionLayer.AttentionLayer
Multiheaded self-attention layer.
- call(x, mask=None, past_kv=None, cache_present_kv=False, training=True, position_bias=None, cache_position_bias=False)¶
Applies the attention mechanism to queries
q
and valuesv
. Keys will be set to be same asv
.- Parameters
q (Tensor) – Queries, shape
[batch_size, seq_length, hidden_size]
.v (Tensor) – Values, shape
[batch_size, seq_length, hidden_size]
.mask (Tensor) – Attention mask. Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
.past_kv (Tensor) – Past keys and values. Has shape
[2, batch_size, num_heads, seq_length, hidden_size / num_heads]
. The tensors in[0,:,:,:,:]
and[1,:,:,:,:]
contain the past keys and values, respectively. Defaults toNone
.cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to
False
.training (bool) – Training the model if
True
. Needed to call thedropout
(after softmax) in the appropriate mode.position_bias (Tensor) – Tensor containing position bias to apply in attention.
cache_position_bias (bool) – Specifies if position bias must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to
False
.
- Returns
when
cache_present_kv
isTrue
andcache_position_bias
isTrue
, returns a tuple, where the 0th entry contains the attention output, 1st entry contains a tensor of keys and values computed at the current application of the attention layer, and the 3rd entry contains a tensor of position bias computed at the current application of the attention layer.If
cache_present_kv
isFalse
, no entry for present keys and values is provided.If
cache_position_bias
isFalse
, no entry for position bias is provided.if both
cache_present_kv
cache_position_bias
are set to False, return a tensor of shape equal to shape ofpast_kv
(see above).