cerebras.modelzoo.data_preparation.nlp.t5.utils.create_transformer_input_features#

cerebras.modelzoo.data_preparation.nlp.t5.utils.create_transformer_input_features(src_tokens, tgt_tokens, src_max_sequence_length, tgt_max_sequence_length, input_pad_id, attn_mask_pad_id, labels_pad_id, tokenize, sos_token='<s>', eos_token='</s>')[source]#

Creates features for Transformer model input.

Parameters
  • src_tokens (list) – Input tokens to process.

  • tgt_tokens (list) – Target tokens to process.

  • src_max_sequence_length (int) – Maximum sequence length of the encoder input.

  • tgt_max_sequence_length (int) – Maximum sequence length of the decoder input.

  • input_pad_id (int) – Input sequence padding id.

  • attn_mask_pad_id (int) – Attention mask padding id.

  • labels_pad_id (int) – Labels padding id.

  • tokenize (callable) – Method to tokenize the input sequence.

  • sos_token (str) – the index of the SOS token in the vocabulary.

  • eos_token (str) – the index of the EOS token in the vocabulary.

Returns

A dict with includes: * np.array[int.32] input_ids: Numpy array with encoder input token indices.

Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] attention_mask: Numpy array with attention mask for encoder.

    Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.

    Shape: (tgt_max_sequence_length). 1 indicates the non masked token, and 0 indicates the masked token.