Kernel autogeneration with AutoGen#

Overview#

To ensure optimal model performance on Cerebras hardware, we leverage a combination of pre-written kernels and automatically generated kernels, known as AutoGen kernels.

Handwritten kernels form a core library, covering many common operations. However, certain scenarios may require AutoGen kernels:

  • Missing Operations: When a model requires operations not available in the handwritten library, AutoGen kernels fill the gap.

  • Performance Optimization: Even for operations with handwritten implementations, AutoGen can often create specialized, fused kernels that outperform them. This is particularly beneficial for large compound operations like losses.

The Distributed Task Generator (DTG) seamlessly handles AutoGen kernel creation, ensuring efficient execution on Cerebras hardware.

Cerebras automatically optimizes your model by crafting custom kernels on the fly, both for missing operations and CPU-bound ones, achieving a balanced “medium” performance mode by default.

How to enable AutoGen#

Activate AutoGen by setting the autogen_policy flag to “medium” or “aggressive” within the “runconfig” section of the model’s parameters YAML file. Explore YAML details further at Cerebras Model Zoo YAML parameters.

runconfig:
  ...
  autogen_policy: "medium"
  ...

The autogen_policy flag can be one of the following:

  • disabled : no AutoGen ops

  • default, medium: try to autogenerate kernels on the wafer instead of executing them on the CPU host.

  • aggressive: autogenerate some kernels even if existing handwritten wafer kernels exist. This is primarily for debugging purposes.

Note

For most cases, use AutoGen with medium autogen_policy.

Usage examples#

Autogenerate kernels for non-loss operations#

To modify a base model within the Cerebras Model Zoo, for example, to tailor GPT-3 to your needs, swap Gelu with LeakyRelu by making two key adjustments in the parameters YAML file:

1. Update Nonlinearity: Change the nonlinearity setting to “LeakyRelu” within the model configuration section.

2. Enable AutoGen: In the runconfig section, set “autogen_policy” to “medium” to ensure smooth generation of the necessary LeakyRelu kernel.

model:
  nonlinearity: "leaky_relu"
  ...
runconfig:
  autogen_policy: "medium"
  ...

Autogenerate fused kernels for loss operations#

AutoGen improves the performance of PyTorch losses by creating autogenerated fused graphs for losses that were previously covered by primitive kernels.

Note

AutoGen is an experimental feature and may result in unexpected compilation failures, even for the list of supported losses.

To implement a PyTorch loss using AutoGen:

  • Import the PyTorch loss from our Model Zoo

  • Set use_autogen=True The default value of use_autogen is False.

from modelzoo.common.pytorch.layers import BCELoss

loss = BCELoss(reduction='mean', use_autogen=True)

List of supported losses:

  • BCELoss

  • CrossEntropyLoss

  • Loss.BCEWithLogitsLoss

  • Loss.GaussianNLLLoss

  • Loss.HingeEmbeddingLoss

  • Loss.HuberLoss

  • Loss.KLDivLoss

  • Loss.L1Loss

  • Loss.MarginRankingLoss

  • Loss.MSELoss

  • Loss.MultiLabelSoftMarginLoss

  • Loss.MultiMarginLoss

  • Loss.NLLLoss

  • Loss.PoissonNLLLoss

  • Loss.SmoothL1Loss

  • Loss.TripletMarginLoss

  • Loss.TripletMarginWithDistanceLoss

Unsupported losses:

  • Loss.CosineEmbeddingLoss (Will compile to primitive kernels and performance will be slower)*

Note

To ensure optimal performance gains from AutoGen, verify that both autogen_policy is enabled in the parameters YAML file and use_autogen is set to True for the specific loss. Disabling use_autogen for a loss will revert to a combination of primitive operations, potentially sacrificing performance benefits.

Autogenerate kernels for customized losses#

Creating custom losses may result in compilation failure due to a graph mismatch. If this occurs, enable AutoGen for the customized loss by adding the AutoGen wrapper @autogen_loss as a decorator for the loss class. Once the custom loss is defined, follow the steps in Autogenerate fused kernels for loss operations to enable the generation of fused kernels.

from modelzoo.common.pytorch.layers.utils import autogen_loss

@autogen_loss
class CustomLoss(nn.Module):
    def __init__(self, ...):

Implementation notes#

Release 1.9.1#

In Release 1.9.1, we enabled the following AutoGen capabilities

  • Support for PyTorch operations (e.g., nonlinearities)

  • Improving the performance of PyTorch losses through fused kernels

  • Support for user-defined loss functions