cerebras.modelzoo.data.nlp.bert.BertTokenClassifierDataProcessor.create_ner_features#

cerebras.modelzoo.data.nlp.bert.BertTokenClassifierDataProcessor.create_ner_features(tokens_list, labels_list, label_map, max_sequence_length, input_pad_id, attn_mask_pad_id, labels_pad_id, include_padding_in_loss, tokenize)[source]#

Creates the features dict for token classifier model.

Parameters
  • tokens_list (list) – Tokens to process

  • labels_list (list) – Labels to process

  • label_map (dict) – Dictionary mapping label to int

  • max_sequence_length (int) – Maximum sequence length.

  • input_pad_id (int) – Input sequence padding id.

  • attn_mask_pad_id (int) – Attention mask padding id.

  • labels_pad_id (int) – Labels padding id.

  • include_padding_in_loss (bool) – Flag to generate loss mask.

  • tokenize (callable) – Method to tokenize the input sequence.

Returns

dict for features which includes keys: * ‘input_ids’: Numpy array with input token indices.

shape: (max_sequence_length), dtype: int32.

  • ’attention_mask’: Numpy array with attention mask.

    shape: (max_sequence_length), dtype: int32.

  • ’loss_mask’: Numpy array equal to attention mask if

    include_padding_in_loss is False, else all ones. shape: (max_sequence_length), dtype: int32.

  • ’token_type_ids’: Numpy array with segment ids.

    shape: (max_sequence_length), dtype: int32.

  • ’labels’: Numpy array with labels.

    shape: (max_sequence_length), dtype: int32.