cerebras.modelzoo.data_preparation.utils#

Functions

`convert_str_to_int_list`	Converts a string (e.g. from parsing CSV) of the form
`convert_to_unicode`	Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging
`count_total_documents`	Counts total number of documents in metadata_files. :param str or list[str] metadata_files: Path or list of paths to metadata files. :returns: Number of documents whose paths are contained in the metadata files.
`create_masked_lm_predictions`	Creates the predictions for the masked LM objective :param list tokens: List of tokens to process :param list vocab_words: List of all words present in the vocabulary :param bool mask_whole_word: If true, mask all the subtokens of a word :param int max_predictions_per_seq: Maximum number of masked LM predictions per sequence :param float masked_lm_prob: Masked LM probability :param rng: random.Random object with shuffle function :param Optional[list] exclude_from_masking: List of tokens to exclude from masking.
`get_files_in_metadata`	Function to read the files in metadata file provided as input to data generation scripts.
`get_label_id_map`	Load the label-id mapping: Mapping between output labels and id :param str label_vocab_file: Path to the label vocab file
`get_output_type_shapes`
`get_vocab`	Function to generate vocab from provided vocab_file_path.
`pad_input_sequence`
`pad_instance_to_max_seq_length`
`split_list`	Splits list/string into n sized chunks.
`text_to_tokenized_documents`	Convert the input data into tokens :param str data: Contains data read from a text file :param tokenizer: Tokenizer object which contains functions to convert words to tokens :param bool multiple_docs_in_single_file: Indicates whether there are multiple documents in the given data string :param str multiple_docs_separator: String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like "-----" etc. There can only be one separator string for all the documents. :param bool single_sentence_per_line: Indicates whether the data contains one sentence in each line :param spacy_nlp: spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences :return List[List[List]] documents: Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j.
`whitespace_tokenize`	Splits a piece of text based on whitespace characters

Classes

maskedLmInstance

Create new instance of maskedLmInstance(index, label)

previous

cerebras.modelzoo.data_preparation.raw_dataset_processor.utils.Reader

next

cerebras.modelzoo.data_preparation.utils.convert_str_to_int_list