cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.truncate_or_pad_helper#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.truncate_or_pad_helper(segments_fim_format_pairs, diff, fim_pad_tok_id, sample_idx)[source]#

Since we perform FIM at character-level, we potentially split characters in the middle of a word. This can lead to non-standard token sequences, and after re-tokenizing we might need to truncate or pad to get back to the original context length. This function ensures that our outputs are back at their original length.

Parameters
  • segments_fim_format_pairs (List[Tuple[List[List[int]], str]]) – This list of tuples is used

  • formats (to store the prefix/middle/suffix token-id lists and the corresponding FIM) –

  • formatting. (be used downstream in the FIM) –

  • diff (int) – The number of tokens to add or remove. Positive means truncate, negative means pad

  • fim_pad_tok_id (int) – Id of padding token

Returs:

(List[Tuple[List[List[int]], str]]): The element of the tuples will now be lists that are truncated or padded such that the concatenation of all these tokens, along with the special tokens, will be equal to the original sequence length.