Kernel autogeneration with AutoGen#

Concept#

Sometimes, not all operations necessary to run a model are available in our handwritten kernel library. In other cases, a large compound operation (such as a loss) can be implemented by primitive operations in the kernel library, at the cost of some overhead compared to a specialized fused implementation. Therefore sometimes it may be necessary (or just beneficial) to let the compiler automatically generate kernel implementations for the Cerebras hardware on the fly. These are known as AutoGen kernels, which are handled by the Distributed Task Generator (DTG).

In 1.9.0, we enable the following AutoGen capabilities

Support for PyTorch operations (e.g., nonlinearities)
Improving performance of PyTorch losses through fused kernels
Support for user-defined loss functions

By default, AutoGen kernels will kick in to handle operations without an existing implementation or to implement some operations that would otherwise be executed on the CPU. This is equivalent to the “medium” autogen_policy below.

How to enable AutoGen#

You may enable AutoGen by modifying the runconfig portion of the parameters yaml file of the model. To learn more about the parameters yaml files, visit Cerebras Model Zoo YAML parameters.

runconfig:
  ...
  autogen_policy: "medium"
  ...

The autogen_policy flag can be one of:

disabled : no AutoGen ops
default, medium: when possible, try to autogenerate kernels on the wafer instead of executing them on the CPU host.
aggressive: autogenerate some kernels even if existing handwritten wafer kernels exist. This is primarily for debugging purposes.

Note

For most practical cases, to use AutoGen, you will choose medium policy.

Usage examples#

Autogenerate kernels for non-loss operations#

You may wish to modify a base model within the Cerebras Model Zoo. For example, use GPT-3 with LeakyRelu rather than Gelu nonlinearity. For this, you will modifying the parameters yaml file nonlinearity, as well as set autogen_policy: "medium" :

model:
  nonlinearity: "leaky_relu"
  ...
runconfig:
  autogen_policy: "medium"
  ...

Autogenerate fused kernels for loss operations#

AutoGen can improve the performance of PyTorch losses by creating autogenerated fused graphs of losses that were previously covered by primitve kernels. AutoGen is an experimental feature and may result in unexpected compilation failures, even for the list of supported losses.

To implement a PyTorch loss via AutoGen, import the PyTorch loss from our Model Zoo and set use_autogen=True. The default value of use_autogen is False.

from modelzoo.common.pytorch.layers import BCELoss

loss = BCELoss(reduction='mean', use_autogen=True)

List of supported losses:

BCELoss
CrossEntropyLoss
Loss.BCEWithLogitsLoss
Loss.GaussianNLLLoss
Loss.HingeEmbeddingLoss
Loss.HuberLoss
Loss.KLDivLoss
Loss.L1Loss
Loss.MarginRankingLoss
Loss.MSELoss
Loss.MultiLabelSoftMarginLoss
Loss.MultiMarginLoss
Loss.NLLLoss
Loss.PoissonNLLLoss
Loss.SmoothL1Loss
Loss.TripletMarginLoss
Loss.TripletMarginWithDistanceLoss

Unsupported losses:

Loss.CosineEmbeddingLoss
- (Will compile to primitive kernels and performance will be slower)

Note

If only autogen_policy is enabled in the parameters yaml file and the `use_autogen is set to False in the loss, then the loss will be implemented via some combination of primitive operations (e.g. handwritten, CPU, Autogen kernels) instead of attempting to generate a fused kernel. This may result in a performance penalty.

Autogenerate kernels for customized losses#

Creating custom losses may result in compilation failure due to a graph mismatch. If this occurs, enable AutoGen for the customized loss by adding the AutoGen wrapper @autogen_loss as a decorator for the loss class. Once the custom loss is defined, follow the steps in Autogenerate fused kernels for loss operations to enable generation of fused kernels.

from modelzoo.common.pytorch.layers.utils import autogen_loss

@autogen_loss
class CustomLoss(nn.Module):
    def __init__(self, ...):

Model Development

Define environment variables for input workers