cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.handle_bos_token_default#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.handle_bos_token_default(tokenizer)[source]#

When performing FIM, we tokenize each chunk again after splitting. Therefore, if the tokenizer adds bos-token by default, we will get extra bos-tokens in the middle of the sequence. In this function, we set the tokenizer bos default to False, and return a flag that indicates whether we will need to add bos-token in the final fim formatting function.