cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.truncate_helper#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.truncate_helper(samples_lst, diff, sample_idx)[source]#

The goal of our truncation scheme is to avoid removing tokens from the middle section. We first remove from the end of suffix, and then from the beginning of the prefix. We store the chunks in lists in the original order so that we can easily perform this truncation. Since each sub-context can have different amounts of tokens in suffix/prefix, we store unique indices for the section to remove from. If we run out of tokens to remove from, we switch to the next. This way we can switch to the prefix of one context while still removing from suffix of another. If the sub-context is AR (auto-regressive) and not FIM, the AR sequence is stored as [[], [], [sequence]] so that the remove_idx being 2 will simultaneously work for the AR and FIM sequences.

Parameters

samples_lst (List[List[int]]) – List of lists that contain token ids
diff (int) – Number of tokens to pad
sample_idx (int) – Index for the sample from the dataset, for use in logging if we remove from the middle.

Returns

List of lists of token ids that have been truncated

Return type

(List[List[int]])

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.split_text_and_tokenize

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.truncate_or_pad_helper