cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.split_text_and_tokenize#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.split_text_and_tokenize(text, tokenizer, max_tok_len=2000, remove_bos_in_chunks=True)[source]#

Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences. This is done to avoid performance issues with tokenizers like LlamaTokenizer which are slow for long sequences.

Parameters
  • text (str) – text to be tokenized

  • tokenizer (Tokenizer) – tokenizer to be used

  • max_tok_len (int, optional) – max length of each sequence. Defaults to 2000.

  • remove_bos_in_chunks (bool, optional) – whether to ignore bos token id in chunks. Defaults to True.

Returns

list of token ids for the text

Return type

tok_ids (list)