cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.create_features_llava_phase1#

cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.create_features_llava_phase1(doc_obj, max_sequence_length, num_patches=None, pad_id=0, eos_id=0, bos_id=None, eos_after_prompt=False, sep_id=None, inverted_mask=False, handle_default_bos_token=False, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32')[source]#

Given a list of VSL sequences, generate input features and labels.

Parameters
  • bin (list(sequence)) – list of VSL sequences.

  • max_sequence_length (int) – Maximum sequence length for data writes.

  • num_pad (int) – number of padding tokens in the sequence.

  • pad_id (int) – Id for pad token. Defaults to 0.

  • eos_id (int) – Id for end of sequence token. Defaults to 0.

  • sep_id (int) – Id for separator token. Defaults to None.

  • inverted_mask (bool) – Invert mask if specified for runtime execution. Defaults to False.

  • input_ids_dtype (str) – Dtype as string for input ids. Defaults to int32.

  • input_mask_dtype (str) – Dtype as string for input mask. Defaults to int32.

  • labels_dtype (str) – Dtype as string for labels. Defaults to int32.

Returns

Tuple containing features and labels