Kernel autogeneration with AutoGen#

Concept#

For neural networks to run on Cerebras hardware, they must be first lowered from a higher-level framework like PyTorch or Tensorflow into a CIRH compute graph. CIRH is the internal graph representation used as the input of the Cerebras compilation stack. Subgraphs of the CIRH graph are matched to kernels, which are subgraphs with known implementations. Anything that is not “matched” becomes part of the “unmatched” or “leftover” graph, which is passed to the Distributed Task Generator (DTG).

AutoGen handles leftover nodes during compilation, alleviating the need for handwritten kernels. It can also be explicitly called to handle parts of the subgraph (e.g., losses) or improve performance of primitive operations by handling operations that would normally be executed on the CPU host or by fusing primitive operations together (e.g,. some nonlinearities like LeakyReLU). In 1.8.0 we enable the following AutoGen capabilities:

  • Support for PyTorch operations (e.g., nonlinearities)

  • Improving performance of PyTorch losses through fused kernels

  • Support for user-defined loss functions

Usage examples#

Changing a non-loss operation#

You may wish to modify a base model (e.g., use GPT-3 with LeakyRelu rather than Gelu nonlinearity) within the Cerebras Model Zoo.

In addition to modifying the parameters yaml file nonlinearity:

model:
  nonlinearity: "leaky_relu"
  ...

You may enable AutoGen by modifying the runconfig portion of the yaml to ensure this model will compile. For weight streaming appliance mode autogen_policy can be one of:

  • default, disabled : no autogen ops

  • medium: when possible, try to autogenerate kernels on the wafer instead of executing them on the CPU host.

  • aggressive: autogenerate some kernels even if existing handwritten wafer kernels exist. This is primarily for debugging purposes.

For most practical cases, to use AutoGen, you will choose medium policy.

runconfig:
  ...
  autogen_policy: "medium"
  ...

Changing loss operations#

AutoGen can improve the performance of PyTorch losses by creating autogenerated fused graphs of losses that were previously covered by primitve kernels. AutoGen is an experimental feature and may result in unexpected compilation failures, even for the list of supported losses.

To enable AutoGen for existing PyTorch losses, import the PyTorch loss from our Model Zoo and enable AutoGen by setting use_autogen=True. The default value of use_autogen is False.

from modelzoo.common.pytorch.layers import BCELoss

loss = BCELoss(reduction='mean', use_autogen=True)

List of supported losses:

  • BCELoss

  • CrossEntropyLoss

  • Loss.BCEWithLogitsLoss

  • Loss.GaussianNLLLoss

  • Loss.HingeEmbeddingLoss

  • Loss.HuberLoss

  • Loss.KLDivLoss

  • Loss.L1Loss

  • Loss.MarginRankingLoss

  • Loss.MSELoss

  • Loss.MultiLabelSoftMarginLoss

  • Loss.MultiMarginLoss

  • Loss.NLLLoss

  • Loss.PoissonNLLLoss

  • Loss.SmoothL1Loss

  • Loss.TripletMarginLoss

  • Loss.TripletMarginWithDistanceLoss

Unsupported losses:

  • Loss.CosineEmbeddingLoss
    • (Will compile to primitive kernels and performance will be slower)

Using a customized loss#

Creating custom losses may result in compilation failure due to a graph mismatch. If this occurs, enable AutoGen for the customized loss by adding the AutoGen wrapper @autogen_loss as a decorator for the loss class.

from modelzoo.common.pytorch.layers.utils import autogen_loss

@autogen_loss
class CustomLoss(nn.Module):
    def __init__(self, ...):