cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.chunk#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.chunk(sample, tokenizer, fim_rate, spm_rate)[source]#

Since we do character-level FIM we need to detokenize, determine boundaries to split, and re-tokenize after splitting. We chunk but do not shuffle and add special tokens because we might have to truncate or pad the tokens since they have been split at the character-level and re-tokenized, leading to potentially different lengths than the original sequence. If the sub-context is designated to be an AR (auto-regressive) sequence and not FIM, we store as [[], [], [sequence]] for convenience in the truncate_helper function.

Parameters

sample (np.array) –
tokenizer (Tokenizer) –
fim_rate (float) –
spm_rate (float) –

Returns

List of token lists corresponding to the: prefix/middle/suffix tokens, or 2 empty lists plus the whole sequence in case of auto-regressive (AR) sequence. Also returns string representing the format of the sequence (i.e. SPM or PSM or AR)

Return type

List[List[int]], str

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.check_fim_special_tokens

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.collect_stats