cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.fim#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.fim(sample_array, sample_idx, tokenizer, fim_rate, spm_rate, suffix_tok_id, prefix_tok_id, middle_tok_id, fim_pad_tok_id, eos_tok_id, opt_bos_tok_id)[source]#

Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability

Parameters
  • sample_array (np.array) – Stack of input_ids, mask, and labels after tokenization. Labels are off-by-one of input_ids

  • training (as in standard auto-regressive) –

  • i (int) – Index of sample from dataset, used for logging.

  • tokenizer (Tokenizer) – Tokenizer object

  • fim_rate (float) – Determines what percentage of contexts are FIM’ed

  • spm_rate (float) – Determines what percentage of FIM’ed contexts are in SPM format. 1 - spm_rate determines PSM

  • suffix_tok_id (int) – Id for special token denoting suffix section in a FIM’ed context

  • prefix_tok_id (int) – Id for special token denoting prefix section in a FIM’ed context

  • middle_tok_id (int) – Id for special token denoting middle section in a FIM’ed context

  • fim_pad_tok_id (int) – Id for padding

  • eos_tok_id (int) – Id for the end-of-seqence

  • opt_bos_tok_id (list) – Optionally a list containing the bos token id, otherwise will be empty list. Empty list will be a no-op in the concatenation. Bos-token will only exist if model’s tokenizer adds bos-token by default.

Returns

Stack of input_ids, mask, and labels after FIM transformation. Mask and labels have been adjusted to still filter padding tokens and represent the following token, respectively.

Return type

fim_outputs (np.array)