cerebras.modelzoo.data_preparation.utils#

Functions

convert_str_to_int_list

Converts a string (e.g. from parsing CSV) of the form

convert_to_unicode

Converts text to unicode, assuming utf-8 input Returns text encoded in a way suitable for print or tf.compat.v1.logging

count_total_documents

Counts total number of documents in metadata_files. :param str or list[str] metadata_files: Path or list of paths to metadata files. :returns: Number of documents whose paths are contained in the metadata files.

create_masked_lm_predictions

Creates the predictions for the masked LM objective :param list tokens: List of tokens to process :param list vocab_words: List of all words present in the vocabulary :param bool mask_whole_word: If true, mask all the subtokens of a word :param int max_predictions_per_seq: Maximum number of masked LM predictions per sequence :param float masked_lm_prob: Masked LM probability :param rng: random.Random object with shuffle function :param Optional[list] exclude_from_masking: List of tokens to exclude from masking.

get_files_in_metadata

Function to read the files in metadata file provided as input to data generation scripts.

get_label_id_map

Load the label-id mapping: Mapping between output labels and id :param str label_vocab_file: Path to the label vocab file

get_output_type_shapes

get_vocab

Function to generate vocab from provided vocab_file_path.

pad_input_sequence

pad_instance_to_max_seq_length

split_list

Splits list/string into n sized chunks.

text_to_tokenized_documents

Convert the input data into tokens :param str data: Contains data read from a text file :param tokenizer: Tokenizer object which contains functions to convert words to tokens :param bool multiple_docs_in_single_file: Indicates whether there are multiple documents in the given data string :param str multiple_docs_separator: String used to separate documents if there are multiple documents in data. Separator can be anything. It can be a new blank line or some special string like "-----" etc. There can only be one separator string for all the documents. :param bool single_sentence_per_line: Indicates whether the data contains one sentence in each line :param spacy_nlp: spaCy nlp module loaded with spacy.load() Used in segmenting a string into sentences :return List[List[List]] documents: Contains the tokens corresponding to sentences in documents. List of List of Lists [[[],[]], [[],[],[]]] documents[i][j] -> List of tokens in document i and sentence j.

whitespace_tokenize

Splits a piece of text based on whitespace characters

Classes

maskedLmInstance

maskedLmInstance(index, label)