.. _job-scheduler: Cerebras job scheduling and monitoring ====================================== Resource management of the Cerebras Wafer-Scale Cluster is done by its management node. All jobs submitted to the Cerebras cluster are queued, and assigned to resources in a first come first served basis. You can interact with the resource management using a CLI tool preinstalled in the user node called ``csctl`` to: + :ref:`job-tracking`: Inspect the state of submitted jobs, and cancel own jobs if necessary. + :ref:`queue-tracking`: Review which jobs are queued and which jobs are running on the Cerebras cluster. + :ref:`get-volumes`: Get a list of mounted volumes on the Cerebras cluster. These volumes can be used to stage code and training data. + :ref:`export-logs`: Export Cerebras cluster logs of a given job to the user node. These logs can be useful when debugging a job failure and working with Cerebras support team. Use the ``csctl`` tool directly from the terminal of your user node. For example, to get the help message, you can do :: $ csctl --help Cerebras cluster command line tool. Usage: csctl [command] Available Commands: cancel Cancel job config Modify csctl config files get Get resources label Label resources log-export Gather and download logs. types Display resource types Flags: --csconfig string config file (default is $HOME/.cs/config) (default "$HOME/.cs/config") -d, --debug int higher debug values will display more fields in output objects -h, --help help for csctl Use "csctl [command] --help" for more information about a command. Configuration ------------- ``csctl`` requires a cluster configuration file. The cluster configuration file is saved as ``/opt/cerebras/config`` when the user node installer is run. You can specify the path to the configuration file with the flag ``--csconfig`` as :: csctl --csconfig /opt/cerebras/config If the flag is not specified, ``csctl`` will use the configuration file at the path ``$HOME/.cs/config`` as default. .. _job-tracking: Job Tracking ------------ Each training job submitted to the Cerebras cluster launches two sequential jobs: first, compilation job is launched; and when compilation is completed, a execution job is launched. These jobs are identified by a ``jobID``. The ``jobID`` for these jobs will be printed on the terminal after they start running on the Cerebras Cluster. In the following example, we highlight the compilation and execution job .. code-block:: bash :emphasize-lines: 8, 16 Extracting the model from framework. This might take a few minutes. WARNING:root:The following model params are unused: precision_opt_level, loss_scaling 2023-02-05 02:00:00,450 INFO: Compiling the model. This may take a few minutes. 2023-02-05 02:00:00,635 INFO: Initiating a new compile wsjob against the cluster server. 2023-02-05 02:00:00,761 INFO: Compile job initiated ... 2023-02-05 02:02:00,899 INFO: Ingress is ready. 2023-02-05 02:02:00,899 INFO: Cluster mgmt job handle: {'job_id': 'wsjob-aaaaaaaaaa000000000', 'service_url': 'cluster-server.cerebras.com:443', 'service_authority': 'wsjob-aaaaaaaaaa000000000-coordinator-0.cluster-server.cerebras.com', 'compile_dir_absolute_path': '/cerebras/cached_compile/cs_0000000000111111'} 2023-02-05 02:02:00,901 INFO: Creating a framework GRPC client: cluster-server.cerebras.com:443, , wsjob-aaaaaaaaaa000000000-coordinator-0.cluster-server.cerebras.comp 2023-02-05 02:07:00,112 INFO: Compile successfully written to cache directory: cs_000000000011111 2023-02-05 02:07:30,118 INFO: Compile for training completed successfully! 2023-02-05 02:07:30,120 INFO: Initiating a new execute wsjob against the cluster server. 2023-02-05 02:07:30,248 INFO: Execute job initiated ... 2023-02-05 02:08:00,321 INFO: Ingress is ready. 2023-02-05 02:08:00,321 INFO: Cluster mgmt job handle: {'job_id': 'wsjob-bbbbbbbbbbb11111111', 'service_url': 'cluster-server.cerebras.com:443', 'service_authority': 'wsjob-bbbbbbbbbbb11111111-coordinator-0.cluster-server.cerebras.com', 'compile_artifact_dir': '/cerebras/cached_compile/cs_0000000000111111'} ... The ``jobID`` is also recorded in the file ``run_meta.json`` during job submission. You will find a ``run_meta.json`` file in every directory where you have submitted a job. All ``jobIDs`` are appended in the ``run_meta.json`` . ``run_meta.json`` contains two sections: ``compile_jobs`` and ``execute_jobs``. Once a training job is submitted and before compilation is done, the compile job will be recorded under ``compile_jobs``. For this example you will see .. code-block:: json :emphasize-lines: 4 { "compile_jobs": [ { "id": "wsjob-aaaaaaaaaa000000000", "log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000", "start_time": "2023-02-05T02:00:00Z", }, ] } After the compilation job has been completed and the training job is scheduled, then the compile job will report additional log information and the jobID of the training job will be recorded under ``execute_jobs``. To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see .. code-block:: json :emphasize-lines: 4,9,15,17 { "compile_jobs": [ { "id": "wsjob-aaaaaaaaaa000000000", "log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000", "start_time": "2023-02-05T02:00:00Z", "cache_compile": { "location": "/cerebras/cached_compile/cs_0000000000111111", "available_time": "2023-02-05T02:02:00Z" } } ], "execute_jobs": [ { "id": "wsjob-bbbbbbbbbbb11111111", "log_path": "/cerebras/workdir/wsjob-bbbbbbbbbbb11111111", "start_time": "2023-02-05T02:02:00Z" } ] } Using the ``jobID``, you can query information about status of a job in the system using :: csctl [--csconfig path] [-d int] get job [-o json|yaml] where: +-------------+---------+---------------------------------------------------------------------------------------------+ | Flag | Default | Description | +=============+=========+=============================================================================================+ | -o | `table` | Output Format: `table`, `json`, `yaml` | +-------------+---------+---------------------------------------------------------------------------------------------+ | -d, --debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects | +-------------+---------+---------------------------------------------------------------------------------------------+ For example, with debug level equals to zero, the output is :: $ csctl -d0 get job wsjob-000000000000 -oyaml meta: createTime: "2022-12-07T05:10:16Z" labels: label: customed_label user: user1 name: wsjob-000000000000 type: job spec: user: gid: "1001" uid: "1000" volumeMounts: - mountPath: /data name: data-volume-000000 subPath: "" - mountPath: /dev/shm name: dev-shm subPath: "" status: phase: SUCCEEDED systems: - systemCS2_1 .. note:: Compilation and execution jobs are queued and executed sequentially in the Cerebras cluster. This means that the compilation job is completed before the execution job is scheduled. Compilation jobs do not require CS-2 resources, therefore they are executed immediatly after launching the job. Execution jobs require CS-2 resources, therefore they will be queued up until sufficient CS-2 resources are available. Compilation and execution jobs have different ``jobID``s. Job Termination --------------- You can terminate any compilation or execution job before completion by providing the ``jobID``. More details on ``jobID`` in :ref:`job-tracking`. To cancel a job, you can use :: csctl [--csconfig path] cancel job Terminating a job releases all resources and sets the job to a cancelled state. An example output to cancel a job is :: $ csctl cancel job wsjob-000000000000 Job cancelled success .. _queue-tracking: Queue Tracking -------------- To obtain a full list of jobs completed, running, and queued on the Cerebras cluster, you can use :: csctl get jobs By default, this command produces a table including: +--------+------------------------------------------+ | Field | Description | +========+==========================================+ | Name | jobID identification | +--------+------------------------------------------+ | Age | Time since job submission | +--------+------------------------------------------+ | Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED | +--------+------------------------------------------+ | Labels | Customized labels by user | +--------+------------------------------------------+ For example, :: $ csctl get jobs NAME AGE PHASE SYSTEMS USER LABELS wsjob-000000000000 43h SUCCEEDED systemCS2_1 user1 label=custom_label_1 wsjob-000000000001 18h RUNNING systemCS2_1, systemCS2_2 user2 label=custom_label_2 wsjob-000000000002 1h QUEUED user2 label=custom_label_3 To assign labels to your jobs, use the flag ``--job_labels`` when you submit your training job. You can use a list of equal-sign-separated key value pairs served as job labels. For example, to assign a job label to training a FCMNIST model using **PyTorch**, you would use .. code-block:: bash python run.py --appliance --job_labels custom_label --params params.yaml --num_csx=1 --model_dir=model_dir --mode train --credentials_path= --mount_dirs --python_paths And to assign a job label to training a FCMNIST model using **Tensorflow**, you would use .. code-block:: bash python run-appliance.py --job_labels custom_label --params params.yaml --num_csx=1 --model_dir model_dir --compile_only --mode train --credentials_path= --mount_dirs --python_paths Directly executing the command prints out a long list of current and past jobs. You can use ``grep`` to extract relevant information of what jobs are queued versus running and how many systems are occupied. When you ``grep 'RUNNING'``, you see a list of jobs that are currently running on the cluster. For example, as shown below, there is one job running. :: $ csctl get jobs | grep 'RUNNING' wsjob-000000000001 18h RUNNING systemCS2_1, systemCS2_2 user2 label=custom_label_2 When you ``grep 'QUEUED'``, you see a list of jobs that are currently queued and waiting for system availability to start training. For example, at the same time of the above running job, there is another job currently queued, as shown below.:: $ csctl get jobs | grep 'QUEUED' wsjob-000000000002 1h QUEUED user2 label=custom_label_3 .. _get-volumes: Get Mounted Volumes ------------------- To get a list of mounted volumes on the Cerebras cluster, you can use :: csctl get volume For example, :: $ csctl get volume NAME TYPE CONTAINERPATH SERVER SERVERPATH READONLY training-data-volume nfs /ml 10.10.10.10 /ml false These volumes can be used to stage code and training data. .. _export-logs: Log Export -------------- To download Cerebras cluster logs of a given job to the user node, you can use :: csctl [--csconfig path] log-export [-b] [-p ] with optional flags: +---------------------+---------------+-------------------------------------------------------+ | Flag | Default Value | Description | +=====================+===============+=======================================================+ | -b, --binaries | False | Include binary debugging artifacts | +---------------------+---------------+-------------------------------------------------------+ | -p, --path | "." | Specify the path where log archive will be downloaded.| +---------------------+---------------+-------------------------------------------------------+ | -h, --help | | Informative message for log-export | +---------------------+---------------+-------------------------------------------------------+ For example:: $ csctl log-export wsjob-example-0 Gathering log data within cluster... Starting a fresh download of log archive. Downloaded 0.55 MB. Logs archive: ./wsjob-example-0.zip Cerebras cluster logs can be useful when debugging a job failure and working with Cerebras support team.