modelzoo.transformers.pytorch.t5.t5_model.T5ForConditionalGeneration#
- class modelzoo.transformers.pytorch.t5.t5_model.T5ForConditionalGeneration[source]#
Bases:
torch.nn.Module
T5 Model with a language modeling head on top.
- Parameters
src_vocab_size (
int
, optional, defaults to 32128) – Source vocabulary size of the T5 model. Defines the number of different tokens that can be represented by theinputs_ids
passed when callingT5Model
orTFT5Model
.tgt_vocab_size (
int
, optional, defaults to 32128) – Target vocabulary size of the T5 model. Only useful if set for Transformer variant where source and target vocabularies can be different.d_model (
int
, optional, defaults to 512) – Size of the encoder layers and the pooler layer.d_kv (
int
, optional, defaults to 64) – Size of the key, query, value projections per attention head.d_kv
does not have tobe equal tod_model // num_heads
.d_ff (
int
, optional, defaults to 2048) – Size of the intermediate feed forward layer in eachT5Block
.encoder_num_hidden_layers (
int
, optional, defaults to 6) – Number of hidden layers in the Transformer encoder.decoder_num_hidden_layers (
int
, optional) – Number of hidden layers in the Transformer decoder. Will use the same value asnum_layers
if not set.num_heads (
int
, optional, defaults to 8) – Number of attention heads for each attention layer in the Transformer encoder and decoder.relative_attention_num_buckets (
int
, optional, defaults to 32) – The number of buckets to use for each attention layer.norm_type (
str
, optional, defaults to “rmsnorm”) – Determines which type of layernorm to use. RMSNorm is the same as T5-style layernorm (no mean subtraction and bias correction).dropout_rate (
float
, optional, defaults to 0.1) – The ratio for all dropout layers.layer_norm_eps (
float
, optional, defaults to 1e-6) – The epsilon used by the layer normalization layers.initializer_factor (
float
, optional, defaults to 1) – A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).encoder_nonlinearity (
string
, optional, defaults to"relu"
) –- Type of feed forward layer to be used in encoder. Should be one of
"relu"
or"geglu"
or "gelu"
. T5v1.1 uses the"geglu"
feed forward projection. Original T5 uses"relu"
.
- Type of feed forward layer to be used in encoder. Should be one of
decoder_nonlinearity (
string
, optional, defaults to"relu"
) – Type of feed forward layer to be used in decoder. Should be one of"relu"
or"geglu"
or"gelu"
. T5v1.1 uses the"geglu"
feed forward projection. Original T5 uses"relu"
.use_cache (
bool
, optional, defaults toTrue
) – Whether or not the model should return the last key/values attentions (not used by all models).( (position_embedding_type) – obj: string, optional, defaults to
"relative"
): The type of position embedding to use. Should be one of"fixed"
,"learned_absolute"
,"relative"
, orNone
."fixed"
uses a concatenation of sin curves to express relative position as used in the original Transformer paper."learned_absolute"
uses a learned vector for each position in the sequence."relative"
uses learned relative position embeddings as introduced in https://arxiv.org/abs/1803.02155, configured as done in the original T5 publication.None
turns off position embedding altogether.
- :param src_max_position_embeddings (
int
: obj: 512): Maximum source sequence length to train using to train the model.
- Parameters
optional – obj: 512): Maximum source sequence length to train using to train the model.
to (defaults) – obj: 512): Maximum source sequence length to train using to train the model.
- :param tgt_max_position_embeddings (
int
: obj: 512): Maximum target sequence length to train using to train the model.
- Parameters
optional – obj: 512): Maximum target sequence length to train using to train the model.
to (defaults) – obj: 512): Maximum target sequence length to train using to train the model.
- :param use_dropout_outside_residual_path (
bool
: obj: True): Whether to set dropout calculations outside of the residual path. Set to True for T5, but False for Transformer.
- Parameters
optional – obj: True): Whether to set dropout calculations outside of the residual path. Set to True for T5, but False for Transformer.
to (defaults) – obj: True): Whether to set dropout calculations outside of the residual path. Set to True for T5, but False for Transformer.
- :param share_encoder_decoder_embedding (
bool
: obj: True): Whether to share encoder/decoder embedding layer. Set to True for both T5 and Transformer models.
- Parameters
optional – obj: True): Whether to share encoder/decoder embedding layer. Set to True for both T5 and Transformer models.
to (defaults) – obj: True): Whether to share encoder/decoder embedding layer. Set to True for both T5 and Transformer models.
- :param share_embedding_weights (
bool
: obj: True): Whether to share embedding weights between decoder and language model head.
- Parameters
optional – obj: True): Whether to share embedding weights between decoder and language model head.
to (defaults) – obj: True): Whether to share embedding weights between decoder and language model head.
- :param relu_dropout_rate (
int
: obj: 0.1): Dropout rate utilized in the FFN layer after applying relu activation function. This parameter is set to 0 for Transformer model, and set to dropout_rate for default T5 configuration. Transformer reference: https://github.com/tensorflow/tensor2tensor/blob/5623deb79cfcd28f8f8c5463b58b5bd76a81fd0d/tensor2tensor/models/transformer.py#L1811 T5 reference: https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/t5/modeling_t5.py#L261
- Parameters
optional – obj: 0.1): Dropout rate utilized in the FFN layer after applying relu activation function. This parameter is set to 0 for Transformer model, and set to dropout_rate for default T5 configuration. Transformer reference: https://github.com/tensorflow/tensor2tensor/blob/5623deb79cfcd28f8f8c5463b58b5bd76a81fd0d/tensor2tensor/models/transformer.py#L1811 T5 reference: https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/t5/modeling_t5.py#L261
to (defaults) – obj: 0.1): Dropout rate utilized in the FFN layer after applying relu activation function. This parameter is set to 0 for Transformer model, and set to dropout_rate for default T5 configuration. Transformer reference: https://github.com/tensorflow/tensor2tensor/blob/5623deb79cfcd28f8f8c5463b58b5bd76a81fd0d/tensor2tensor/models/transformer.py#L1811 T5 reference: https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/t5/modeling_t5.py#L261
- :param use_pre_encoder_decoder_dropout (
bool
: obj: False): Whether to use dropout layer after positional embedding layer and encoder/decoder. This is set to False for T5 and True for Transformer.
- Parameters
optional – obj: False): Whether to use dropout layer after positional embedding layer and encoder/decoder. This is set to False for T5 and True for Transformer.
to (defaults) – obj: False): Whether to use dropout layer after positional embedding layer and encoder/decoder. This is set to False for T5 and True for Transformer.
- :param use_pre_encoder_decoder_layer_norm (
bool
: obj: True): Whether to use layer norm before passing input tensors into encoder/decoder. This is set to True for T5 and False for Transformer.
- Parameters
optional – obj: True): Whether to use layer norm before passing input tensors into encoder/decoder. This is set to True for T5 and False for Transformer.
to (defaults) – obj: True): Whether to use layer norm before passing input tensors into encoder/decoder. This is set to True for T5 and False for Transformer.
- :param use_ffn_bias (
bool
: obj: False): Whether to use bias in the hidden layer with relu activation. This is set to False for T5, and True for Transformer.
- Parameters
optional – obj: False): Whether to use bias in the hidden layer with relu activation. This is set to False for T5, and True for Transformer.
to (defaults) – obj: False): Whether to use bias in the hidden layer with relu activation. This is set to False for T5, and True for Transformer.
- :param lm_loss_weight (
float
: obj: 1.0): Value that scales loss by the mean number of predictions per sequence in the dataset.
- Parameters
optional – obj: 1.0): Value that scales loss by the mean number of predictions per sequence in the dataset.
to (default) – obj: 1.0): Value that scales loss by the mean number of predictions per sequence in the dataset.
use_transformer_initialization (
bool
, optional, defaults toFalse
) – The Transformer model tends to converge best with a scaled variant on Xavier uniform initialization used for linear layers. This contrasts the initialization used for the original T5 paper, which uses He normal initialization for linear layers. Setting this flag to True switches the initialization to the Transformer specific scaled Xavier initialization.
Methods
compute_decoder_states
compute_hidden_states
compute_sequence_output
labels (
torch.LongTensor
of shape(batch_size,)
, optional):get_input_embeddings
get_output_embeddings
reset_parameters
set_input_embeddings
set_output_embeddings
Tie the weights between the input embeddings and the output embeddings and (if enabled) tie encoder/decoder weights.
- __call__(*args: Any, **kwargs: Any) Any #
Call self as a function.
- __init__(src_vocab_size=32128, tgt_vocab_size=32128, d_model=512, d_kv=64, d_ff=2048, encoder_num_hidden_layers=6, decoder_num_hidden_layers=None, num_heads=8, relative_attention_num_buckets=32, norm_type='rmsnorm', dropout_rate=0.1, relu_dropout_rate=None, layer_norm_epsilon=1e-06, initializer_factor=1.0, encoder_nonlinearity='relu', decoder_nonlinearity='relu', use_projection_bias_in_attention=False, attention_softmax_fp32=True, use_cache=False, decoder_start_token_id=None, pad_token_id=0, position_embedding_type='relative', src_max_position_embeddings=512, tgt_max_position_embeddings=512, use_dropout_outside_residual_path=True, share_encoder_decoder_embedding=True, share_embedding_weights=True, tie_encoder_decoder=False, use_pre_encoder_decoder_dropout=False, use_pre_encoder_decoder_layer_norm=True, use_ffn_bias=False, label_smoothing=0.0, use_transformer_initialization=False, attention_module='aiayn_attention', extra_attention_params={}, **kwargs)[source]#
- static __new__(cls, *args: Any, **kwargs: Any) Any #
- forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, encoder_outputs=None, past_key_values=None, labels=None, use_cache=None, prepend_embeddings=None)[source]#
- labels (
torch.LongTensor
of shape(batch_size,)
, optional): Labels for computing the sequence classification/regression loss. Indices should be in
[-100, 0, ..., config.vocab_size - 1]
. All labels set to-100
are ignored (masked), the loss is only computed for labels in[0, ..., config.vocab_size]
Returns:
Examples:
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration >>> tokenizer = T5Tokenizer.from_pretrained('t5-small') >>> model = T5ForConditionalGeneration.from_pretrained('t5-small') >>> # training >>> input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids >>> labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids >>> outputs = model(input_ids=input_ids, labels=labels) >>> loss = outputs.loss >>> logits = outputs.logits >>> # inference >>> input_ids = tokenizer("summarize: studies have shown that owning a dog is good for you", return_tensors="pt").input_ids # Batch size 1 >>> outputs = model.generate(input_ids) >>> print(tokenizer.decode(outputs[0], skip_special_tokens=True)) >>> # studies have shown that owning a dog is good for you.
- labels (