cerebras.modelzoo.data_preparation.utils.text_to_tokenized_documents#

cerebras.modelzoo.data_preparation.utils.text_to_tokenized_documents(data, tokenizer, multiple_docs_in_single_file, multiple_docs_separator, single_sentence_per_line, spacy_nlp)[source]#

Convert the input data into tokens :param str data: Contains data read from a text file :param tokenizer: Tokenizer object which contains functions to

convert words to tokens

Parameters
  • multiple_docs_in_single_file (bool) – Indicates whether there are multiple documents in the given data string

  • multiple_docs_separator (str) – String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like “—–” etc. There can only be one separator string for all the documents.

  • single_sentence_per_line (bool) – Indicates whether the data contains one sentence in each line

  • spacy_nlp – spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences

Return List[List[List]] documents

Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j