Cluster monitoring with Grafana#

Overview#

A Grafana dashboard lets you visualize, query, and explore your system’s metrics and enables you to access your system logs and traces. Cerebras offers you two Cerebras-tailored Grafana Dashboards: Cluster Management Dashboard and WsJob Dashboard.

Grafana dashboards provide a powerful platform to visualize, query, and explore system metrics, while also offering access to system logs and traces. Cerebras offers two customized Grafana Dashboards tailored to their hardware and cluster management needs:

Cluster Management dashboard This dashboard is designed to help users and administrators visualize and manage the Cerebras Wafer-Scale cluster. The WsJob dashboard is specialized for monitoring and managing individual jobs running on the Cerebras Wafer-Scale cluster.

Cluster Management Dashboard#

The Cluster Management dashboard shows the overall state of the cluster. It includes the following:

CS-2 systems - Overall CS-2 system status and errors
Nodes:
- Kubernetes nodes warnings and errors
- Space usage health
Network
- Hardware NIC errors
- Kubernetes CNIs errors
Cluster Management
- Errors for Cerebras Cluster Management services
- Kubernetes system services
Alerts
- Current alerts on the cluster

The following figure displays Cerebras’s Wafer-scale cluster management dashboard:

../../_images/cluster-mgmt-dashboard.png

WsJob Dashboard#

The WsJob Dashboard provides job-level metrics, logs, and traces, allowing users to closely monitor the progress and resource utilization of their specific jobs.

There are five panes in this dashboard:

Job overview
- Displays the overview of memory/cpu/network bandwidth numbers for all replicas of selected job
Job associated software errors
- Displays job runtime errors (currently only shows OOMKilled status)
Job associated hardware errors
- Displays any NIC, CS-2, or physical node that is assigned to this job and is having errors during the job execution
Replica view
- Displays memory/cpu/network bandwidth numbers for each replica_id of this replica_type in each chart. Replica_type represents a type of service processes for a given job. It can be one of these types: weight, command, activation, broadcastreduce, chief, worker, coordinator. Replica_id corresponds to the specific replica for a job and a replica type
Assigned nodes
- Displays physical nodes status that are assigned to the chosen replica_type and replica_id
MemX performance
- Shows iteration-based performance, iteration time, cross-iteration time, and backward iteration time

The following figure displays Cerebras’s WsJob dashboard:

../../_images/perf_dashboard_full_view.png

On the left you can find options to search for particular metrics and view metric details.

There are also filters for users to select:

wsjob

Indicates the ID of the weight-streaming run, which is used to select between different runs on a particular system
replica_type

Allows selecting between the activation, weight, and all server metrics
assigned_systems

Indicates the system name being shown in the logs

Other fields available that are useful are the model, job_type, and the replica_id.

Prerequisites#

You have access to the user node in the Cerebras Wafer-Scale cluster. Contact your sys admin if you face any issues in the system configuration.

You can run a port-forwarding SSH session through the user node from your machine with this command:

$ ssh -L 8443:grafana.<cluster-name>.<domain>.com:443 myUser@usernode
Note

This command uses the local port 8443 to forward the traffic. You can choose any unoccupied port on your machine.

How to get access?#

Links are accessible from the General/Cerebras tab. The following figure displays a Cerebras dashboard:

Steps to get access#

1. Ask your system admintrator to set up the Grafana database. URLs come in the format: grafana.CLUSTER-NAME.DOMAIN.com For example: grafana.mb-systemf102.cerebras.com

2. Get authentication credentials for Grafana (username and password) from your system administrator.

3. Add the Grafana TLS certificate to your browser keychain. The grafana TLS certificate is located at /opt/cerebras/certs/grafana_tls.crt on the user node. This certificate is copied during user node installation process. Download this certificate to your local machine and add this certificate to your browser keychain.

On a Chrome browser on Mac OS:

Go to Preferences -> Privacy and Security -> Security -> Manage Certificates

Add grafana-tls.crt into System keychain certificates. Make sure to set Always Trust when using this certificate

Next, edit your local machine’s /etc/hosts file to point the IP of the user node to Grafana: <USERNODE_IP> grafana.<cluster-name>.<domain>.com

Finally, navigate in your browser to the URL https://grafana.<cluster-name>.<domain>.com to access the Grafana Dashboards

Viewing performance metrics using the WsJob dashboard#

You can view cluster iteration-performance metrics by tracking update times across the weight servers.

Our current dashboard implementation shows iteration time, forward-iteration time, backward-iteration time, cross-iteration time, and input starvation.

Iteration time

Indicates the time from the end of iteration “i-1” on the weight servers to the end of iteration i on the weight servers.
Forward-iteration time

Indicates the time spent in iteration “i” during the forward pass.
Backward-iteration time

Indicates the time spent in iteration “i” during the backward pass.
Cross-iteration time

Indicates the time between the last gradient receive of an iteration to the first weight send. A high value indicates an optimizer performance bottleneck.
Input starvation

Indicates the time spent waiting on the framework to receive activations.

These statistics are shown in the following image and can be used to identify performance bottlenecks in the training process:

Viewing utilization metrics using the WsJob Dashboard#

The Replica view metric displays memory/cpu/network bandwidth numbers for each replica_id of this replica_type in each chart. Replica_type represents a type of service process for a given job. It can be one of these types: weight, command, activation, broadcastreduce, chief, worker, and coordinator.

1. Transmit bandwidth indicates the maximum and mean network egress speeds for each activation server. This might be helpful information to monitor whether jobs are network-bound via the transmission speeds of a lagging node.

The following figure shows that most weight servers achieve a network transmit speed of ~420 MB/s:

../../_images/perf_dashboard_transmit_bw.png

2. Receive bandwidth denotes the ingress speeds for each supporting server. For example, in this instance, the weight servers have an average ingress speed of around 220MB/s.

The following figure shows the receive bandwidth metric:

../../_images/perf_dashboard_receive_bw.png

3. CPU usage shows the CPU percentage utilization for each weight-server. In this case, the CPUs are only 2-3% utilized.

The following figure shows the cpu usage metric:

4. Memory usage indicates the maximum and mean amounts of memory each weight server uses over time. This can be useful in debugging whether the weight servers are memory bound. For more information on memory requirements, visit Resource requirements for parallel training and compilation.

The following figure shows the memory usage metric:

5. You can use the Grafana interface to show individual metrics for each node. For example, these are the views for CPU and memory usage per node:

The following figure shows the cpu usage per node metric:

../../_images/perf_dashboard_cpu_node.png

The following figure shows the memory usage per node metric:

../../_images/perf_dashboard_memory_node.png

csctl: CLI tool for job monitoring

Integration with Slurm