modelzoo.transformers.data_processing.scripts.chunk_preprocessing.fim_data_token_generator.FIMTokenGenerator#
- class modelzoo.transformers.data_processing.scripts.chunk_preprocessing.fim_data_token_generator.FIMTokenGenerator[source]#
-
Initialize the FIMPreprocessor class. :param params: Params from config file :type params: args
Methods
Tokenize and encode the data for auto-regressive language modeling.
Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
Get the token ID for the given token.
Processes chunks of tokenized text and returns processed features along with the total padding added.
Tokenize the provided text.
Tokenize the text and create features for auto-regressive language modeling.
- __init__(params, tokenizer, eos_id, pad_id)[source]#
Initialize the FIMPreprocessor class. :param params: Params from config file :type params: args
- encode(data: str) List[numpy.ndarray] [source]#
Tokenize and encode the data for auto-regressive language modeling.
- Parameters
data (str) – Text to tokenize
- Returns
Encoded features for auto-regressive language modeling.
- Return type
List[np.ndarray]
- encode_leftover_prefix(prefix: List[numpy.ndarray]) Tuple[List[numpy.ndarray], int] #
Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.
- Parameters
prefix (List[np.ndarray]) – The prefix list of token arrays to process.
- Returns
A tuple containing the processed token chunks as a list of ndarrays and the number of padding tokens added.
- Return type
Tuple[List[np.ndarray], int]
- get_token_id(token: str) int #
Get the token ID for the given token.
- Parameters
token (str) – Token for which the ID is needed.
- Returns
Token ID.
- Return type
int
- process_chunks(tokenized_text_chunks: List[List[int]]) Tuple[List[Any], int] #
Processes chunks of tokenized text and returns processed features along with the total padding added.
Args: tokenized_text_chunks (List[List[int]]): A list of tokenized text chunks, where each chunk is represented as a list of integers.
Returns: Tuple[List[Any], int]: A tuple containing a list of processed results and the total number of padding tokens added.
- tokenize_text(text: str) List[int] #
Tokenize the provided text.
- Parameters
text (str) – Text to tokenize.
- Returns
List of token IDs.
- Return type
List[int]
- tokenize_text_auto_lm(text: str) List[numpy.ndarray] #
Tokenize the text and create features for auto-regressive language modeling.
- Parameters
text (str) – Text to tokenize.
- Returns
Features created for auto-regressive language modeling.
- Return type
List[np.ndarray]