cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.format_fim#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.format_fim(segment_fim_format_pairs, max_seq_len, suffix_tok_id, prefix_tok_id, middle_tok_id, eos_tok_id, opt_bos_tok_id)[source]#

Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats. Performs the correct transformation according to the format, adding the special tokens and shuffling the sections, before concatenating everything together.

Parameters

segments_fim_format_pairs (List[Tuple[List[List[int]], str]]) – This list of tuples is used
formats (to store the prefix/middle/suffix token-id lists and the corresponding FIM) –
formatting. (be used downstream in the FIM) –
max_seq_len (int) – Max sequence length that each sequence is expected to match
suffix_tok_id (int) – Id for suffix token
prefix_tok_id (int) – Id for suffix token
middle_tok_id (int) – Id for suffix token
eos_tok_id (int) – Id for suffix token
opt_bos_tok_id (list) – Optionally a list containing the bos token id, otherwise will be empty list. Empty list will be a no-op in the concatenation. Bos-token will only exist if model’s tokenizer adds bos-token by default. Both have to be lists so that np concat works

Returns

Array of token ids in the FIMed order: along with special tokens
mask (np.array): Array of 1’s and 0’s corresponding to true: tokens and padding respectively
label (np.array): Token i of label corresponds to token i+1 in: sample array. Same elements except that label ends in eos (end-of-sequence) token

Return type

sample (np.array)

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.fim

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.get_files