common.tf.metrics package#

Submodules#

common.tf.metrics.accuracy module#

Classification accuracy script, simple relative of cross-entropy loss.

common.tf.metrics.accuracy.calculate_accuracy(logits, labels, seq_lens=None)#

Calculate accuracy, simple relative of cross-entropy loss that counts how many argmax predictions of logits is correct (top-1 accuracy)

Parameters
  • logits (Tensor) – has the size [batch_size, output_vocab_size] or [batch_size, max_seq_len, output_vocab_size]

  • labels (Tensor) – has the size [batch_size] or [batch_size, max_seq_len]

  • seq_lens (Tensor) – Defaults to None. If not none, represents lengths for sentences represented by logits and has size [batch_size]. If logits has seq dim, this Tensor being None means we assume all sequences have max sequence length.

Returns

integer top-1 accuracy

common.tf.metrics.bits_per_x module#

common.tf.metrics.bits_per_x.bits_per_x_metric(total_loss_per_batch, num_tokens, bits_per_type='per_byte', dataset='pile', metrics_collections=None, updates_collections=None, name=None)#

Custom TF evaluation meric for calculating bits per x over the validation set, where x is one of byte, character, or word.

Pass to Estimator through eval_metric_ops, TF will accumulate loss/token over the entire validation set and use that value to calculate bits per byte, character, or word.

common.tf.metrics.bits_per_x.calculate_bits_per_x(total_loss_per_step, total_target_tokens_per_step, bits_per_type='per_byte', dataset='pile')#
Calculates the bits per type per target token, where type is one of

byte, character, or word.

Parameters
  • total_target_tokens_per_step (int) – The total loss summed over a train step or perhaps a longer period.

  • total_target_tokens_per_step – The total number of target tokens seen in that period.

  • bits_per_type (str) – The type of bits_per metric to compute. This is one of per_byte, per_character, or per_word. Defaults to per_byte.

  • dataset (str) – The dataset to compute metric for. This is needed for the bits_per_byte metric. Defaults to pile.

Returns

The bits_per metric computed based on the specified type and dataset

common.tf.metrics.dice_coefficient module#

Dice coefficient metric to be used with TF Estimator,

common.tf.metrics.dice_coefficient.dice_coefficient_metric(labels, predictions, num_classes, weights=None, metrics_collections=None, updates_collections=None, name=None)#

Calculate per-step Dice Coefficient. Dice Coefficient is a common evaluation metric for semantic image segmentation. Dice Coefficient is defined as follows:

Dice = 2 * true_positive / (2 * true_positive + false_positive + false_negative).

The predictions are accumulated in a confusion matrix, weighted by weights, and dice ceoffiecient is then calculated from it. For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the dice_coefficient. If weights is None, weights default to 1. Use weights of 0 to mask values.

Returns

A Tensor representing the dice coefficient. update_op: An operation that increments the confusion matrix.

Return type

dice_coefficient

Raises
  • ValueError – If predictions and labels have mismatched shapes, or if

  • weights

  • either metrics_collections or updates_collections are not a list or

  • tuple.

  • RuntimeError – If eager execution is enabled.

Parameters
  • labels (Tensor) – A Tensor of ground truth labels with shape [batch size] and of type int32 or int64. The tensor will be flattened if its rank > 1.

  • predictions (Tensor) – A Tensor of prediction results for semantic labels, whose shape is [batch size] and type int32 or int64. The tensor will be flattened if its rank > 1.

  • num_classes (int) – The possible number of labels the prediction task can have. This value must be provided, since a confusion matrix of dimension = [num_classes, num_classes] will be allocated.

  • weights (Tensor) – Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding labels dimension).

  • metrics_collections (List) – An optional list of collections that dice_coefficient should be added to.

  • updates_collections (List) – An optional list of collections update_op should be added to.

  • name (string) – An optional variable_scope name.

common.tf.metrics.ece_loss_metric module#

common.tf.metrics.ece_loss_metric.ece_loss_metric(total_loss_per_batch, num_tokens, metrics_collections=None, updates_collections=None, name=None)#
Custom TF evaluation metric for calculating Exponential Cross Entropy

loss metric (ECE) over the validation set.

Usage: Pass to Estimator through eval_metric_ops, TF will accumulate loss/token over the entire validation set and use that value to calculate ECE. Based off of the Tensorflow mean metric code: (TF 2.1.0): tensorflow/python/ops/metrics_impl.py#L315-L393.

common.tf.metrics.f1_score module#

common.tf.metrics.f1_score.f1_score_metric(labels, predictions, weights=None, metrics_collections=None, updates_collections=None, name=None)#

Calculate f1 score for binary classification

f1 score is defined as follows:

f1 = 2 * (precision *recall)/(precsion + recall)

Uses fbeta metric implementation underneath.

The predictions are accumulated in a confusion matrix, weighted by weights, and f1 score is then calculated from it. For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the f1. If weights is None, weights default to 1. Use weights of 0 to mask values.

Returns

A scalar representing f1. update_op: An operation that increments the confusion matrix.

Return type

f1

Raises
  • ValueError – If predictions and labels have mismatched shapes, or if

  • weights is not None and its shape doesn't match predictions, or if

  • either metrics_collections or updates_collections are not a list or

  • tuple.

  • RuntimeError – If eager execution is enabled.

  • InvalidArgumentError – If labels and predictions are not binary

  • i.e. values not in [0, 1]

Parameters
  • labels (Tensor) – A Tensor of binary ground truth labels with shape [batch size] and of type int32 or int64. The tensor will be flattened if its rank > 1.

  • predictions (Tensor) – A Tensor of binary prediction results for semantic labels, whose shape is [batch size] and type int32 or int64. The tensor will be flattened if its rank > 1.

  • weights (Tensor) – Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding labels dimension).

  • metrics_collections (List) – An optional list of collections that f1_metric should be added to.

  • updates_collections (List) – An optional list of collections update_op should be added to.

  • name (string) – An optional variable_scope name.

common.tf.metrics.fbeta_score module#

common.tf.metrics.fbeta_score.fbeta_score_metric(labels, predictions, num_classes, ignore_labels=None, beta=1, weights=None, average_type='micro', metrics_collections=None, updates_collections=None, name=None)#

Calculate fbeta for multi-class scenario.

fbeta is defined as follows:

fbeta = (1+beta**2) * (precision *recall)/((beta**2 * precsion) + recall)

The predictions are accumulated in a confusion matrix, weighted by weights, and fbeta is then calculated from it. For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the fbeta. If weights is None, weights default to 1. Use weights of 0 to mask values.

Returns

A scalar representing fbeta. update_op: An operation that increments the confusion matrix.

Return type

fbeta

Raises
  • ValueError – If predictions and labels have mismatched shapes, or if

  • weights is not None and its shape doesn"t match predictions, or if

  • either metrics_collections or updates_collections are not a list or

  • tuple.

  • ValueError – If average_type is not either “micro” or “macro”

  • RuntimeError – If eager execution is enabled.

Parameters
  • labels (Tensor) – A Tensor of ground truth labels with shape [batch size] and of type int32 or int64. The tensor will be flattened if its rank > 1.

  • predictions (Tensor) – A Tensor of prediction results for semantic labels, whose shape is [batch size] and type int32 or int64. The tensor will be flattened if its rank > 1.

  • num_classes (int) – The possible number of labels the prediction task can have. This value must be provided, since a confusion matrix of dimension = [num_classes, num_classes] will be allocated.

  • ignore_labels (Tensor) – Optional Tensor which specifies the labels to be considered when computing metric

  • beta (int) – Optional param for beta parameter

  • weights (Tensor) – Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding labels dimension).

  • average_type (str) –

    Optional string specifying the type of averaging on data “micro”: Calculate metrics globally by counting the total

    true positives, false negatives and false positives.

    ”macro”: Calculate metrics for each label, and find their unweighted mean.

    This does not take label imbalance into account.

  • metrics_collections (List) – An optional list of collections that fbeta_metric should be added to.

  • updates_collections (List) – An optional list of collections update_op should be added to.

  • name (string) – An optional variable_scope name.

Example

y_true = [0,0,0,0,0,0,0,1,1,1,1,2,2] y_pred = [0,0,1,1,1,2,2,0,0,1,2,0,2] confusion_matrix(y_true, y_pred) -> rows id = true label, column id = predicted label

array([[2, 3, 2],

[2, 1, 1], [1, 0, 1]])

True_Positives = [2, 1, 1] Predicted_Pos = [5, 4, 4] Actual_Pos = [7, 4, 2]

precision recall f1-score

0 0.40 0.29 0.33 1 0.25 0.25 0.25 2 0.25 0.50 0.33

macro avg 0.30 0.35 0.31

accuracy = 4/13 = 0.307

A. average_type = “micro”, ignore_labels = None precision = (2+1+1)/(5+4+4) = accuracy recall = (2+1+1)/(7+4+2)

B. average_type = “macro”, ignore_labels = None per_class precision: [(2/5), (1/4), (1/4)] precision_macro = mean(per_class_precision) per_class recall = [(2/7), (1/4), (1/2)] recall = mean(per_class_recall) fb = mean([fb_0, fb_1, fb_2])

C. average_type = “micro”, ignore_labels = [1] precision = (2+1)/(5+4) recall = (2+1)/(7+2)

D. average_type = “macro”, ignore_labels = [1] precision = mean([(2/5) , (1/4)]) recall = mean([(2/7), (1/2)]) fb = mean([fb_0, fb_2])

common.tf.metrics.mcc module#

Matthews Correlation Coefficient metric to be used with TF Estimator.

common.tf.metrics.mcc.mcc_metric(labels, predictions, weights=None, metrics_collections=None, updates_collections=None, name=None)#

Custom TF evaluation meric for calculating MCC when performing binary classification.

Usage: Pass outputs to Estimator through eval_metric_ops.

The predictions are accumulated in a confusion matrix, weighted by weights, and MCC is then calculated from it.

If weights is None, weights default to 1. Use weights of 0 to mask values.

Returns

A Tensor representing the Matthews Correlation Coefficient. update_op: An operation that increments the confusion matrix.

Return type

mcc

Parameters
  • labels (Tensor) – A Tensor of binary labels with shape [batch size] and of type int32 or int64. The tensor will be flattened if its rank > 1.

  • predictions (Tensor) – A Tensor of binary predictions, whose shape is [batch size] and type int32 or int64. The tensor will beflattened if its rank > 1.

  • weights (Tensor) – Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding labels dimension).

  • metrics_collections (List) – An optional list of collections that mcc should be added to.

  • updates_collections (List) – An optional list of collections update_op should be added to.

  • name (string) – An optional variable_scope name.

common.tf.metrics.perplexity module#

common.tf.metrics.perplexity.calculate_perplexity(total_loss_per_step, total_target_tokens_per_step)#

Calculates the perplexity per target token. The _safe_exp funciton is needed to avoid float overflow when calculating the perplexity in the very first training steps. The loss at the start can be many 1000s, and when expotentiated to give perplexity, it overflows. This caps the perpelexity to be representable as a float. Any reasonably training model will not hit this limit after the first couple train steps. Inspired by the tf/nmt perpelexity calculation: https://github.com/tensorflow/nmt/blob/master/nmt/model_helper.py#L637

Parameters
  • total_target_tokens_per_step (int) – The total loss summed over a train step or perhaps a longer period.

  • total_target_tokens_per_step – The total number of target tokens seen in that period.

Returns

Perplexity per token

common.tf.metrics.perplexity.perplexity_metric(total_loss_per_batch, num_tokens, metrics_collections=None, updates_collections=None, name=None)#
Custom TF evaluation meric for calculating perplexity over the validation set.

Usage: Pass to Estimator through eval_metric_ops, TF will accumulate loss/token over the entire validation set and use that value to calculate perplexity. Based off of the Tensorflow mean metric code: (TF 2.1.0): tensorflow/python/ops/metrics_impl.py#L315-L393

common.tf.metrics.rouge_score module#

common.tf.metrics.rouge_score.rouge_score_metric(hypothesis, references, max_n=1, metrics_collections=None, updates_collections=None, name=None)#

Custom TF evaluation metric for calculating rouge score when performing text summarization.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

  • the fraction of n-grams from abstracts included in the summarization.

egin{equation} ROUGE-n(s) =

rac{sum_{r in R}sum_{w} [w in s][w in r]}{sum_{r in R} sum_{w} [w in r]}

end{equation}

  • $r in R$ – set of abstracts, written by humans.

  • $s$ – abstract, built by the system.

  • higher the better – for all metrics of ROUGE family.

  • $n$ – order of n-gram:
    • $n=1$ – unigrams, $n=2$ – bigrams, etc.

    • with increase of $n$, you achieve more accurate results.

    • with $n =$ len_of_abstract, we require full match of predicted

    text and the one written by humans.

Usage: Pass to Estimator through eval_metric_ops, TF will accumulate loss over the entire validation set and use that value to calculate rouge score.

The num_matched_ngrams, num_reference_ngrams, num_hypothesis_ngrams are accumulated in a rouge matrix, and rouge score (f1, precision, recall) is then calculated from it.

For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the f1, precision and recall.

param Tensor hypothesis

A Tensor of predicted summarization tokens with shape [batch size, max_sequence_length] and of type tf.string.

param Tensor references

A Tensor of tokens from target summarization, whose shape is [batch size, max_sequence_length] and type tf.string.

param int max_n

Optional maximum size of n-grams to consider. Default is 1.

param List metrics_collections

An optional list of collections that rouge_score should be added to.

param List updates_collections

An optional list of collections update_op should be added to.

param string name

An optional variable_scope name.

returns tuple

rouge_score: A dict with Tensors representing f1, precision and recall for rouge-score. update_op: An operation that increments the rouge matrix.

common.tf.metrics.rouge_score.streaming_rouge_matrix(hypothesis, references, max_n=1)#

Calculate a streaming rouge matrix. The num_matched_ngrams, num_reference_ngrams, num_hypothesis_ngrams are accumulated in a rouge matrix. Calculates a rouge matrix. For estimation over a stream of data, the function creates an update_op operation. :param Tensor hypothesis: A Tensor of predicted summarization tokens with shape

[batch size, max_sequence_length] and of type tf.string.

Parameters
  • references (Tensor) – A Tensor of tokens from target summarization, whose shape is [batch size, max_sequence_length] and type tf.string.

  • max_n (int) – Optional maximum size of n-grams to consider. Default is 1.

Returns

total_rm: A Tensor representing the rouge matrix of shape (3, ), which stores num_matched_ngrams, num_reference_ngrams, num_hypothesis_ngrams. update_op: An operation that increments the rouge matrix.

common.tf.metrics.utils module#

common.tf.metrics.utils.aggregate_across_replicas(metrics_collections, metric_value_fn, *args)#

Aggregate metric value across replicas.

common.tf.metrics.utils.metric_variable(shape, dtype, validate_shape=True, name=None)#

Create variable in GraphKeys.(LOCAL|METRIC_VARIABLES) collections.

If running in a DistributionStrategy context, the variable will be “sync on read”. This means:

  • The returned object will be a container with separate variables

    per replica of the model.

  • When writing to the variable, e.g. using assign_add in a metric

    update, the update will be applied to the variable local to the replica.

  • To get a metric’s result value, we need to sum the variable values

    across the replicas before computing the final answer. Furthermore, the final answer should be computed once instead of in every replica. Both of these are accomplished by running the computation of the final result value inside distribution_strategy_context.get_replica_context().merge_call(fn). Inside the merge_call(), ops are only added to the graph once and access to a sync on read variable in a computation returns the sum across all replicas.

Returns

A (non-trainable) variable initialized to zero, or if inside a DistributionStrategy scope a sync on read variable container.

param int shape: Shape of the created variable. param int dtype: Type of the created variable. param bool validate_shape: (Optional) Whether shape validation is enabled for

the created variable.

param string name: (Optional) String name of the created variable.

common.tf.metrics.utils.streaming_confusion_matrix(labels, predictions, num_classes, weights=None)#

Calculate a streaming confusion matrix.

Calculates a confusion matrix. For estimation over a stream of data, the function creates an update_op operation.

Returns

A Tensor representing the confusion matrix. update_op: An operation that increments the confusion matrix.

Return type

total_cm

param Tensor labels: A Tensor of ground truth labels with shape [batch size] and of

type int32 or int64. The tensor will be flattened if its rank > 1.

param Tensor predictions: A Tensor of prediction results for semantic labels, whose

shape is [batch size] and type int32 or int64. The tensor will be flattened if its rank > 1.

param int num_classes: The possible number of labels the prediction task can

have. This value must be provided, since a confusion matrix of dimension = [num_classes, num_classes] will be allocated.

param Tensor weights: Optional Tensor whose rank is either 0, or the same rank as

labels, and must be broadcastable to labels (i.e., all dimensions must be either 1, or the same as the corresponding labels dimension).

Module contents#