cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.format_fim#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.format_fim(segment_fim_format_pairs, max_seq_len, suffix_tok_id, prefix_tok_id, middle_tok_id, eos_tok_id, opt_bos_tok_id)[source]#

Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats. Performs the correct transformation according to the format, adding the special tokens and shuffling the sections, before concatenating everything together.

Parameters
  • segments_fim_format_pairs (List[Tuple[List[List[int]], str]]) – This list of tuples is used

  • formats (to store the prefix/middle/suffix token-id lists and the corresponding FIM) –

  • formatting. (be used downstream in the FIM) –

  • max_seq_len (int) – Max sequence length that each sequence is expected to match

  • suffix_tok_id (int) – Id for suffix token

  • prefix_tok_id (int) – Id for suffix token

  • middle_tok_id (int) – Id for suffix token

  • eos_tok_id (int) – Id for suffix token

  • opt_bos_tok_id (list) – Optionally a list containing the bos token id, otherwise will be empty list. Empty list will be a no-op in the concatenation. Bos-token will only exist if model’s tokenizer adds bos-token by default. Both have to be lists so that np concat works

Returns

Array of token ids in the FIMed order

along with special tokens

mask (np.array): Array of 1’s and 0’s corresponding to true

tokens and padding respectively

label (np.array): Token i of label corresponds to token i+1 in

sample array. Same elements except that label ends in eos (end-of-sequence) token

Return type

sample (np.array)