Measure throughput of your model#
It is often desirable to measure the throughput of a model. In order to provide this information out of the box, Cerebras Model Zoo runs print a couple of throughput metrics to the console as well as sending them to events files that are then viewable in TensorBoard. This section describes what these metrics are, how they are calculated, and intricacies to be aware of when interpreting them.
Out-of-the-box throughput metrics from Cerebras Model Zoo#
There are two throughput metrics that are printed to the console by default
when running a model using Cerebras Model Zoo. These are
It is important to note that these metrics are what’s measured by the user node. While they are useful for getting an overall picture of throughput, they are not exact measurements of throughput seen by the Cerebras Wafer-Scale Cluster. This is due to the asynchronous nature of execution on Cerebras Wafer-Scale, where input workers stream data to the wafer quasi-independently of the user node that’s receiving the outputs.
GlobalRate measures the average throughput of the entire training run. It
does so by dividing total samples for which outputs have been received by total
time since the executable was fully loaded to the Wafer-Scale Engine.
GlobalRate is also logged to events files and is viewable in TensorBoard
Rate measures a smoothed out version of
GlobalRate, where at each
sampling interval (i.e. logging step), a smoothing factor (default of 0.4) is
applied to the previously calculated
Rate and added to the local throughput
since the last sampling point.
Rate is also logged to events files and is viewable in TensorBoard as
Rate is more susceptible to spikes than
GlobalRate, it is more
representative of the current throughput measured by the user node.
Ephemeral throughput spikes after checkpointing#
During a checkpointing step, wafer stops processing samples until a checkpoint
is taken. Once checkpointing on the Wafer-Scale Cluster is complete, streaming
samples from input workers is immediately resumed. However, while the WSE is
processing samples post the checkpoint step, user node may still be downloading the
checkpoint from the Wafer-Scale Cluster, which could take some time, especially
for large checkpoints. As such, the WSE may have computed outputs for a number of
samples but user node will only fetch those outputs once it has fully downloaded
the checkpoint. As a result, once checkpointing on the user node is complete and
it starts fetching outputs from the Wafer-Scale Cluster, a large number of
outputs may be readily available for fetching, which could result in a large
spike in local throughput (i.e.,
Rate) seen. Once user node catches up to the
wafer, throughput will stabalize and return to normal.
This effect is less pronounced in
GlobalRate when doing long training
runs since it’s amortized over the entire training duration.
Throughput in Weight Streaming execution#
In Weight Streaming execution, outputs (such as losses, summaries, etc.) are
received as soon their values have been computed by the wafer (except after a
checkpointing step, as described above). As such, there’s a close one-to-one
correspondence between the throughput achieved by the wafer vs. what the user
node sees (i.e.,
Having said that, the first few logging steps may present outlier throughputs due to difference in when the clock is started on the user node vs. when the wafer actually starts processing data. This effect is short-lived and steady-state throughput is achieved quickly thereafer.
Throughput in Pipeline execution#
Pipeline execution is a bit more complicated than Weight Streaming execution. Before understanding how to interpret throughput in Pipeline execution, it’s worth understanding how outputs are received in this execution mode.
Pipeline execution is better-suited for smaller models, whose throughput is generally higher than that of larger models appropriate for Weight Streaming execution. As such, it’s possible that the user node may not be able to keep up with the wafer if it were to fetch outputs one batch at a time. This is especially true if the batch size is small. Therefore, to amortize the connection overhead of receiving outputs and optimize performance for such cases, user node fetches outputs from Wafer-Scale Cluster in blocks of samples, where a block can contain many batches. This is done automatically by the Cerebras software and is not adjustable by the user.
For example, let’s say user has chosen batch size 32 and would like to log outputs at every step. Let’s further assume that the Cerebras software has chosen an optimal block size of 3200 samples (i.e., 100 batches). In this case, when user node attempts to fetch outputs for the first 32 samples for logging, Cerebras software will actually wait until at least 3200 samples have been processed and whose outputs are available before returning the first 32 samples. Once user node attempts to fetch the next 32 samples, since outputs have already been received and are readily available, they will be returned immediately.
This behavior causes the local throughput (i.e.,
Rate) seen by the user node
to fluctuate quite drastically. More specifically, it will be low for batches
that fall at the boundary of receiving a block, and will be much higher for
batches that fall within a block.
Similar to spikes durnig checkpointing, this effect is less pronounced in
GlobalRate when doing long training runs since it’s amortized over the
entire training duration.