Summarize scalars and tensors in PyTorch#
It is often useful to track various values of interest during training. This section introduces APIs for summarizing various tensors in a PyTorch model and inspecting them during or after a run.
Scalar Summaries#
Motivation#
It is often useful to visualize various scalar values during training. This
may include scalar values such as learning rate, gradient norms, etc. For this,
we provide the scalar_summary
API which allows to summarize scalar model
tensors. These summaries are written to Tensorboard events files and can be
visualized using Tensorboard.
How to Use Scalar Summaries#
The scalar summary API can be imported as follows:
>> from modelzoo.common.pytorch.summaries import scalar_summary
In order to summarize a scalar tensor S
, add the following statement to the
model definition code:
scalar_summary("my_scalar_tensor", S)
During training, the value of S
will be periodically written to the
Tensorboard events file and can be visualized in TensorBoard.
Note
Some common scalar tensors that are generally desirable to track during
training are already implemented in Cerebras Model Zoo. They are disabled
by default, but to enable them, set log_summaries: True
in the
runconfig
section of the params file passed to a Cerebras Model Zoo run.
Tensor Summaries#
Motivation#
In the section above, we described how to summarize scalar values
which can be visualized in TensorBoard. However, there are cases where it is
desirable to summarize arbitrary tensor shapes. Since TensorBoard only supports
scalar summaries, we provide a separate API, which is very similar to
scalar_summary
API, but for summarizing tensors of arbitrary shapes.
How to Use Tensor Summaries#
The tensor summary API can be imported as follows:
>> from modelzoo.common.pytorch.summaries import tensor_summary
In order to summarize a tensor T
, add the following statement to the model
definition code:
>> tensor_summary("my_custom_name", T)
Under the hood, we mark the provided tensor as an output of the graph and fetch
its value at every log step (similar to losses and other scalar summaries). This
value is then written out to a file and can be later retrieved through the
TensorSummaryReader
API (see below).
Here’s a simple example where we’d like to summarize the input features and last layer’s logits of a fully connected network:
from modelzoo.common.pytorch.summaries import tensor_summary
class FC(nn.Module):
def forward(self, features):
tensor_summary("features", features)
logits = self.fc_layer(features)
tensor_summary("last_layer_logits", logits)
return logits
In order to retrieve the saved values of these tensors during or after a run,
use the TensorSummaryReader
API which supports listing all available tensor
names and fetching a tensor by name for a given step. TensorSummaryReader
object takes as input a single argument denoting the path to a Tensorboard
events file or a directory containing Tensorboard events files. Location of
tensor summaries are inferred from these events files as there is a one-to-one
mapping from Tensorboard events files and tensor summary directories.
In the example above, we added summaries for features
and
last_layer_logits
. We can then use the TensorSummaryReader
API to load
the summarized values of these tensors at a given step:
>>> from modelzoo.common.pytorch.summaries import TensorSummaryReader
>>> reader = TensorSummaryReader("model_dir/train")
>>> reader.names() # Grab all tensor summary names
['features', 'last_layer_logits']
>>>
>>> reader.load("features", 2) # Load tensor "features" from step 2
TensorDescriptor(step=2, tensor=tensor([[2, 4],
[6, 8]]), utctime='2023-02-07T05:45:29.017264')
>>>
>>> reader.load("non_existing", 100) # Load a non-existing tensor
WARNING:root:No tensor with name non_existing has been summarized at step 100
TensorSummaryReader.load()
returns one or more TensorDescriptor
objects.
TensorDescriptor
is a POD structure which holds:
step
: The step at which this tensor was summarized.utctime
: The UTC time at which the value was saved.tensor
: The summarized value.
Limitations#
Currently, only tensors smaller than 2GB can be summarized. Note that this applies to each individual tensor. So while summarizing a tensor of size 4GB is not supported, summarizing two or more tensors each with a size of 2GB is supported.
Adding tensor summaries may change how the graph is lowered and can create a different compile. This is because marking a tensor as an output may prevent it from being pruned out in certain operation fusions. From an overall computation standpoint, however, the graphs should be identical. The only difference is how the computation is represented.