# Sparsifying models#

## Overview#

Sparse models require fewer FLOPs per step and less memory to store, making them compelling architectures for deep learning research and application. Sparse training algorithms set portions of model weights to zero and often adjust these proportions throughout training. The Cerebras PyTorch API enables practitioners to easily implement and extend sparsity algorithms and schedules, and directly train with sparse weights on the Cerebras system. This process, which involves setting a proportion of the model’s weights to zero, not only streamlines the model by focusing on essential features but also aids in regularization, potentially improving generalization to new data. Various strategies exist to implement sparsity, each suitable for different models and tasks, making sparsity a versatile tool in optimizing machine learning models for performance and efficiency.

The Cerebras PyTorch API enables practitioners to train with sparse weights, leading to a significant reduction in FLOPs per step.

To learn more about training models with weight sparsity using the Cerebras Model Zoo, click here.

## How to sparsify your model#

Sparsifying a model is straightforward with the cstorch API. Here’s an example where 30% of the values in every parameter (such as weights, biases, embeddings, among others) are set to zero prior to training:

```
import cerebras.pytorch as cstorch
# Initialize your backend
backend = cstorch.backend(...)
# Define your model within the backend's device context
with backend.device:
model: torch.nn.Module = ...
# Compile the model with the backend
compiled_model = cstorch.compile(model, backend)
# Initialize a sparsity algorithm with 30% sparsity
sparsity = cstorch.sparse.Static(sparsity=0.3)
# Apply the sparsity algorithm to the model
model.apply(sparsity)
```

After the call to `model.apply(sparsity)`

, your model parameters are
sparsified, enhancing training efficiency.

**Important considerations**

**1.** Only once the model has been compiled, apply sparsity with `cstorch.compile`

, ensuring all parameters are on the Cerebras device.

**2.** To exclude certain parameters from sparsity, set `param.requires_dense = True`

. If a parameter does not have this attribute, the algorithm assumes that it is `False`

.

### Sparsifying optimizers#

For training, simply sparsifying the model’s parameters is insufficient; the optimizer must be sparsified as well. To extend sparsity to your optimizer:

```
optimizer.apply(sparsity)
```

When you sparsify an optimizer, you’re not only adjusting the optimizer’s state
but also setting up mechanisms such as installing various hooks to ensure that
sparsity patterns are maintained and updated appropriately during training,
specifically when `optimizer.step()`

is executed.

Executing `optimizer.apply(sparsity)`

transforms your optimizer into a sparse
optimizer.

**Important considerations**

**1.** The sparsity algorithm targets all optimizer states associated with a parameter, assuming the state tensor matches the parameter’s shape. To exempt certain state tensors from sparsification, designate them as requiring to be dense:

```
optmizer.state[p][name].requires_dense = True
```

If a state tensor does not have this attribute, the algorithm assumes that it is `False`

.

**2.** Sparsity algorithms typically include a hook that updates the sparsity pattern after each `optimizer.step()`

call. This automatic update feature can be deactivated if necessary:

```
sparsity.autoupdate = False
```

## Sparsity algorithms#

The cstorch API offers several out-of-the-box sparsity algorithms, including:

## Composing Sparsity Algorithms#

You can apply distinct sparsity strategies to various parameter groups within your model. For instance, one group of weights might be statically reduced to 30% of its original values, while another group undergoes dynamic sparsity adjustments using the SET algorithm. This can be achieved by composing different sparsity algorithms into a composable strategy with the
`Group`

class.

```
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.fc1 = torch.nn.Linear(25, 20)
self.fc2 = torch.nn.Linear(20, 10)
...
sparsity = cstorch.sparse.Group({
"fc1.*": cstorch.sparse.Static(sparsity=0.3)
"fc2.*": cstorch.sparse.SET(
sparsity=0.9, update={"freq": 1000}, drop_fraction=0.3
)
})
model.apply(sparsity)
```

This grouped sparsity algorithm applies static sparsity to all model parameters
corresponding to the `fc1.*`

glob pattern, while employing the SET sparsity
algorithm for parameters that match the `fc2.*`

glob pattern.

## Writing custom sparsity algorithms#

All sparsity algorithms must inherit from the base
`SparsityAlgorithm`

.

```
class CustomSparsity(cstorch.sparse.SparsityAlgorithm):
def __init__(self, ..., **kwargs):
...
super().__init__(**kwargs)
...
def update(self):
...
for sparse_param in self.sparse_params.values():
p = sparse_param.data
mask = sparse_param.mask
...
sparse_params.mask = new_mask
...
```

The only abstract method that must be overriden is
`update`

which takes care of updating the sparsity patterns for all sparse parameters.

For algorithms that dynamically change the sparsity pattern, there is a convenient `DynamicSparsityAlgorithm`

class that you can
inherit from that takes care of many of the implementation details required to facilitate dynamic sparsity.

```
class CustomDynamicSparsity(cstorch.sparse.DynamicSparsityAlgorithm):
def __init__(self, ..., **kwargs):
...
super().__init__(**kwargs)
...
def update_mask(p, mask, sparsity) -> torch.Tensor:
...
```

`DynamicSparsityAlgorithm`

already implements
`update`

, but it exposes a
new abstract method
`update_mask`

that
must be overriden instead.
`update_mask`

takes
in the existing sparsity pattern in the form of a mask tensor and must
return the new sparsity pattern in the form of a mask tensor as well.

See `GMP`

,
`SET`

, and
`RigL`

for examples of how to implement
`update_mask`

.

In addition, there are many building blocks that are provided that can be used directly, inherited from, or composed to help build new `DynamicSparsityAlgorithm`

subclasses. See
Customizing Sparsity & Reference for more details.

Once you’ve written your custom sparsity algorithm, as long as it’s available in the global scope, you can use it directly or even through a call to
`configure`

by setting the `algorithm`

to be the name of your custom sparsity algorithm class. By extension, this means that
you can use it in ModelZoo in a similar way by setting the `algorithm`

to be the name of your custom sparsity algorithm class in your params YAML file (see Sparsity via YAML for more details).

## Implementation notes#

The Cerebras Wafer-Scale Cluster natively implements sparse computations in the Compressed Sparse Row (CSR) format. For user convenience, sparse models are represented as a combination of dense tensors and masks at the PyTorch level, with the compiler seamlessly converting between these representations.

While PyTorch provides tools for representing sparse tensors and utilities for pruning
networks,
these features might not fully align with the needs of the Cerebras Wafer Scale
Engine (WSE). Sparse tensors in PyTorch require specialized kernels and may not
be entirely compatible with existing models and utilities. Notably, a
`torch.nn.Parameter`

cannot directly accommodate a `torch.sparse.Tensor`

without specific adjustments. The `torch.prune`

utilities are convenient, but
the asynchronous and precompiled nature of computation on the WSE requires a
custom solution.

Similar to how `torch.prune`

handles its mask tensors, when the sparsity algorithm
is applied to the model, every parameter that is sparsified has a mask tensor
registered as a stateful buffer next to it in the module that owns the parameter.

For example, take the following simple model:

```
model = torch.nn.Linear(2, 2, bias=False)
```

Initially, the model’s state dictionary appears as follows:

```
{
"weight": torch.Tensor(...)
}
```

After applying sparsity, the state dictionary is augmented to include mask tensors, illustrating the model’s transition to a sparsified state:

```
{
"weight": torch.Tensor(...)
"weight_mask": torch.Tensor(...)
}
```

Here, the `weight`

and `weight_mask`

tensors collectively represent the sparsified `weight`

, showing how sparsity is represented within the model’s architecture.

## Conclusion#

Sparsifying your models and optimizers in the Cerebras ecosystem can lead to substantial performance gains and efficiency improvements. By following this guide, you can integrate sparsity into your training loop, tailoring the sparsity patterns to meet your model’s specific needs.