Cerebras job scheduling and monitoring#
All jobs submitted to the Cerebras cluster are queued and assigned to resources in a first come first served basis. You can interact with the resource manager using a CLI tool preinstalled in the user node called csctl
to:
Job Tracking: Inspect the state of submitted jobs, and cancel your own jobs if necessary.
Job Labelling: Apply labels to a job.
Queue Tracking: Review which jobs are queued and which jobs are running on the Cerebras cluster.
Get Configured Volumes: Get a list of configured volumes on the Cerebras cluster. These volumes can be used to stage code and training data.
Log Export: Export Cerebras cluster logs of a given job to the user node. These logs can be useful when debugging a job failure and working with the Cerebras support team.
Worker SSD Cache: Query worker SSD cache usage.
Grafana Dashboard: Grafana dashboard to show job resource usage and possible software and hardware errors relevant to this job.
Use the csctl
tool directly from the terminal of your user node. For example, to get the help message, you can do
$ csctl --help
Cerebras cluster command line tool.
Usage:
csctl [command]
Available Commands:
cancel Cancel job
config View csctl config files
get Get resources
label Label resources
log-export Gather and download logs.
types Display resource types
Flags:
-d, --debug int higher debug values will display more fields in output objects
-h, --help help for csctl
Use "csctl [command] --help" for more information about a command.
Job Tracking#
Each training job submitted to the Cerebras cluster launches two sequential jobs. First, a compilation job is launched. When compilation is completed, an execution job is launched. Each of these is identified by a jobID
. The jobID
for each job will be printed on the terminal after they start running on the Cerebras Cluster.
Extracting the model from framework. This might take a few minutes. WARNING:root:The following model params are unused: precision_opt_level, loss_scaling 2023-02-05 02:00:00,450 INFO: Compiling the model. This may take a few minutes. 2023-02-05 02:00:00,635 INFO: Initiating a new compile wsjob against the cluster server. 2023-02-05 02:00:00,761 INFO: Compile job initiated ... 2023-02-05 02:02:00,899 INFO: Ingress is ready. 2023-02-05 02:02:00,899 INFO: Cluster mgmt job handle: {'job_id': 'wsjob-aaaaaaaaaa000000000', 'service_url': 'cluster-server.cerebras.local:443', 'service_authority': 'wsjob-aaaaaaaaaa000000000-coordinator-0.cluster-server.cerebras.local', 'compile_dir_absolute_path': '/cerebras/cached_compile/cs_0000000000111111'} 2023-02-05 02:02:00,901 INFO: Creating a framework GRPC client: cluster-server.cerebras.local:443 2023-02-05 02:07:00,112 INFO: Compile successfully written to cache directory: cs_000000000011111 2023-02-05 02:07:30,118 INFO: Compile for training completed successfully! 2023-02-05 02:07:30,120 INFO: Initiating a new execute wsjob against the cluster server. 2023-02-05 02:07:30,248 INFO: Execute job initiated ... 2023-02-05 02:08:00,321 INFO: Ingress is ready. 2023-02-05 02:08:00,321 INFO: Cluster mgmt job handle: {'job_id': 'wsjob-bbbbbbbbbbb11111111', 'service_url': 'cluster-server.cerebras.local:443', 'service_authority': 'wsjob-bbbbbbbbbbb11111111-coordinator-0.cluster-server.cerebras.local', 'compile_artifact_dir': '/cerebras/cached_compile/cs_0000000000111111'} ...
The jobID
is also recorded in a file run_meta.json
in every directory where you have submitted a job. All jobIDs
are appended in the run_meta.json
. run_meta.json
contains two sections: compile_jobs
and execute_jobs
. Once a training job is submitted and before compilation is done, the compile job will be recorded under compile_jobs
. For this example you will see
{ "compile_jobs": [ { "id": "wsjob-aaaaaaaaaa000000000", "log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000", "start_time": "2023-02-05T02:00:00Z", }, ] }
After the compilation job has been completed and the training job is scheduled, then the compile job will report additional log information and the jobID of the training job will be recorded under execute_jobs
. To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
{ "compile_jobs": [ { "id": "wsjob-aaaaaaaaaa000000000", "log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000", "start_time": "2023-02-05T02:00:00Z", "cache_compile": { "location": "/cerebras/cached_compile/cs_0000000000111111", "available_time": "2023-02-05T02:02:00Z" } } ], "execute_jobs": [ { "id": "wsjob-bbbbbbbbbbb11111111", "log_path": "/cerebras/workdir/wsjob-bbbbbbbbbbb11111111", "start_time": "2023-02-05T02:02:00Z" } ] }
Using the jobID
, you can query information about status of a job in the system using
csctl [-d int] get job <jobID> [-o json|yaml]
where:
Flag |
Default |
Description |
---|---|---|
-o |
table |
Output Format: table, json, yaml |
-d, –debug |
0 |
Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
For example, with debug level equals to zero, the output is
$ csctl -d0 get job wsjob-000000000000 -oyaml
meta:
createTime: "2022-12-07T05:10:16Z"
labels:
label: customed_label
user: user1
name: wsjob-000000000000
type: job
spec:
user:
gid: "1001"
uid: "1000"
volumeMounts:
- mountPath: /data
name: data-volume-000000
subPath: ""
- mountPath: /dev/shm
name: dev-shm
subPath: ""
status:
phase: SUCCEEDED
systems:
- systemCS2_1
Note
Compilation and execution jobs are queued and executed sequentially in the Cerebras cluster. This means that the compilation job is
completed before the execution job is scheduled. Compilation jobs do not require CS-2 resources, but it requires some resources on
the server nodes. In 1.8, we allow only one concurrent compilation running in the cluster. Execution jobs require CS-2 resources,
they will be queued up until sufficient CS-2 resources are available. Compilation and execution jobs have different jobID
.
Job Termination#
You can terminate any compilation or execution job before completion by providing the jobID
. More details on jobID
in Job Tracking. To cancel a job, you can use
csctl cancel job <jobID>
Terminating a job releases all resources and sets the job to a cancelled state. An example output to cancel a job is
$ csctl cancel job wsjob-000000000000
Job cancelled success
In 1.8, this command might cause the client logs to print
cerebras_appliance.errors.ApplianceUnknownError: Received unexpected gRPC error (StatusCode.UNKNOWN) : 'Stream removed' while monitoring Coordinator for Runtime server errors
This is expected.
Job Labelling#
You can add labels to your jobs, to help categorize your jobs better. There are 2 ways to add labels to your jobs.
One way is to use the flag --job_labels
when you submit your training job. You can use a list of equal-sign-separated key value pairs served as job labels.
For example, to assign a job label to training a FCMNIST model using PyTorch, you would use
python run.py --job_labels framework=pytorch model=FCMNIST --params params.yaml --num_csx=1 --model_dir=model_dir --mode train --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
And to assign a job label to training a FCMNIST model using Tensorflow, you would use
python run.py --job_labels framework=tensorflow --params params.yaml --num_csx=1 --model_dir model_dir --compile_only --mode train --mount_dirs <paths to data> --python_paths <paths to modelzoo and other python code if used>
The other way to add labels to your jobs is through csctl command
csctl label job wsjob-000000000000 framework=tensorflow
You can use this command to remove a label from your job
csctl label job wsjob-000000000000 framework-
Queue Tracking#
To obtain a full list of jobs completed, running, and queued on the Cerebras cluster, you can use
csctl get jobs
By default, this command produces a table including:
Field |
Description |
---|---|
Name |
jobID identification |
Age |
Time since job submission |
Duration |
How long the job ran |
Phase |
One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
Systems |
CS-2 systems used in this job |
User |
User that starts this job |
Labels |
Customized labels by user |
Dashboard |
Grafana dashboard link for this job |
For example,
$ csctl get jobs
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000000 43h 6m27s SUCCEEDED systemCS2_1 user1 model=gpt3xl https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000000
wsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
Directly executing the command prints out a long list of current and past jobs. You can use -l options to return jobs that match with the given set of labels as
$ csctl get jobs -l model=neox,team=ml
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
You also can use grep
to extract relevant information of what jobs are queued versus running and how many systems are occupied.
When you grep 'RUNNING'
, you see a list of jobs that are currently running on the cluster. For example, as shown below, there is one job running.
$ csctl get jobs | grep 'RUNNING'
wsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
When you grep 'QUEUED'
, you see a list of jobs that are currently queued and waiting for system availability to start training. For example, at the same time of the above running job, there is another job currently queued, as shown below.:
$ csctl get jobs | grep 'QUEUED'
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
Get Configured Volumes#
After installing Cerebras cluster, the system admin will configure a few volumes to be used in your jobs to access code and training data. To get a list of mounted volumes on the Cerebras cluster, you can use
csctl get volume
For example,
$ csctl get volume
NAME TYPE CONTAINERPATH SERVER SERVERPATH READONLY
training-data-volume nfs /ml 10.10.10.10 /ml false
Log Export#
To download Cerebras cluster logs of a given job to the user node, you can use
csctl log-export <jobID> [-b]
with optional flags:
Flag |
Default Value |
Description |
---|---|---|
-b, –binaries |
False |
Include binary debugging artifacts |
-h, –help |
Informative message for log-export |
For example:
$ csctl log-export wsjob-example-0
Gathering log data within cluster...
Starting a fresh download of log archive.
Downloaded 0.55 MB.
Logs archive: ./wsjob-example-0.zip
Cerebras cluster logs can be useful when debugging a job failure and working with Cerebras support.
Worker SSD Cache#
To speed up the process of large amount of input data, we allow the users to stage their data in the worker nodes’ local SSD cache. This cache is shared among different users.
Get Worker Cache Usage#
Use this command to obtain the current worker cache usage on each worker node:
$ csctl get worker-cache
NODE DISK USAGE
worker-01 57.86%
worker-02 50.84%
worker-03 49.47%
worker-04 63.56%
worker-05 63.56%
worker-06 63.71%
worker-07 63.22%
worker-09 65.80%
If the cache is full, please contact your system admin to clear the cache.
Grafana Dashboard#
We provide a Cerebras-tailored Grafana Dashboard WsJob Dashboard
, to display job-related metrics.
There are 5 panes in this dashboard:
Job overview
: displays the overview of memory/cpu/network bandwidth numbers for all replicas of selected job.Job associated software errors
: displays job runtime errors, currently only showsOOMKilled
status.Job associated hardware errors
: displays any NIC, CS-2, or physical node that is assigned to this job and is having any errors during the job execution.Replica view
: displays memory/cpu/network bandwidth numbers for eachreplica_id
of thisreplica_type
in each chart.Replica_type
represents a type of service processes for a given job. It can be one of these types:weight
,command
,activation
,broadcastreduce
,chief
,worker
,coordinator
.Replica_id
corresponds to the specific replica for a job and a replica type.Assigned nodes
: displays physical nodes status that are assigned to the chosen replica_type and replica_id.
This dashboard is useful to check the job resource usages and any software or hardware errors relevant to this job.
There is another Cluster Management Dashboard
, which shows overall state of the cluster.
Access Grafana Dashboards through the User Node#
This setup assumes that you can reach the user node from your machne. You can run a port-forwarding SSH session through the user node from your machine with this command:
ssh -L 8443:grafana.<cluster-name>.example.com:443 myUser@usernode
Note this command uses the local port 8443
to forward the traffic. You can choose any unoccupied port on your machine.
The grafana TLS certificate should be located at /opt/cerebras/certs/grafana_tls.crt
on the user node, which is copied
during user node installation process. Download this certificate to your local machine and add
this certificate to your browser keychain.
For example, in a Chrome browser on MacOS: in Preferences
-> Privacy and Security
-> Security
-> Manage Certificates
:
Add grafana-tls.crt
into System
keychain certificates. Make sure set Always Trust
when using this certificate.
Next, edit your local machine’s /etc/hosts
file to point the IP of the user node to Grafana.:
<USERNODE_IP> grafana.<cluster-name>.example.com
Finally, navigate in your browser to the URL HTTPS://grafana.<cluster-name>.example.com
to access the Grafana Dashboards.
Please contact your sysadmin for the username and password.