Running Eleuther AI’s Evaluation Harness#

Overview#

We provide support for running EleutherAI’s Evaluation Harness (EEH) on the Cerebras Wafer-Scale cluster. EEH is a popular framework for evaluating large language models across various different datasets and tasks.

Running Evaluation Harness on CS-2#

To run EEH tasks on CS-2, use the new modelzoo/common/pytorch/run_cstorch_eval_harness.py script from the Cerebras Model Zoo. This script is similar to the run.py scripts normally used to launch model evaluation or training. It accepts the following command-line arguments:

python run_cstorch_eval_harness.py CSX [-h] [--tasks TASKS] [--num_fewshot NUM_FEWSHOT] [--output_path OUTPUT_PATH] [--hf_cache_dir HF_CACHE_DIR] [--keep_data_dir] -p PARAMS [-m {eval}] [-o MODEL_DIR]
      [--checkpoint_path CHECKPOINT_PATH] [--disable_strict_checkpoint_loading] [--load_checkpoint_states LOAD_CHECKPOINT_STATES] [--logging LOGGING] [--wsc_log_level WSC_LOG_LEVEL [WSC_LOG_LEVEL ...]]
      [--max_steps MAX_STEPS] [--eval_steps EVAL_STEPS] [--config CONFIG] [--compile_only | --validate_only] [--num_workers_per_csx NUM_WORKERS_PER_CSX] [-c COMPILE_DIR] [--job_labels JOB_LABELS [JOB_LABELS ...]]
      [--debug_args_path DEBUG_ARGS_PATH] [--mount_dirs MOUNT_DIRS [MOUNT_DIRS ...]] [--python_paths PYTHON_PATHS [PYTHON_PATHS ...]] [--credentials_path CREDENTIALS_PATH] [--mgmt_address MGMT_ADDRESS]
      [--job_time_sec JOB_TIME_SEC] [--disable_version_check] [--num_csx NUM_CSX] [--num_wgt_servers NUM_WGT_SERVERS] [--num_act_servers NUM_ACT_SERVERS] [--transfer_processes TRANSFER_PROCESSES]

Eleuther Eval Harness Arguments

Description

–tasks TASKS

Comma separated string specifying Eleuther Eval Harness tasks

–num_fewshot NUM_FEWSHOT

Number of examples to be added to the fewshot context string. Defaults to 0

–output_path OUTPUT_PATH

Path to directory where eval harness output results will be written

–hf_cache_dir HF_CACHE_DIR

Path to directory for caching Hugging Face downloaded data

–keep_data_dir

Specifies whether dumped data samples should be kept for reuse. Defaults to False, i.e. data samples are deleted after the run

Required arguments

Description

-p PARAMS, –params PARAMS

Path to .yaml file with model parameters

The CSX arguments are exactly the same as our other flows for running training and evaluation jobs on CS-2 using run.py scripts. For the EEH flow, these following arguments are important:

  • --params: This argument specifies the path to .yaml file defining model architecture. This argument is required.

    NOTE: The config params.eval_input.data_dir must be specified. This is the path to the mounted directory visible to the worker containers where EEH task data samples are dumped after preprocessing. Use the --mount_dirs argument to specify a dir mount, similar to our existing flows.

  • --checkpoint_path: This argument specifies the path to the checkpoint file to load model weights from. If a checkpoint path is not provided, we support checkpoint autoloading in this flow such that the latest checkpoint file will be picked up from the model_dir specified. Note that a checkpoint file is needed to run EEH, otherwise we will error out.

  • --tasks, --num_fewshot, and --output_path: These settings specify EEH evaluation tasks, the number of few shot examples that the model sees, and the path where evaluation output results are dumped. See here for more information.

  • --keep_data_dir: This option, if set, preserves the preprocessed data samples generated by the DataLoader for the EEH task data, i.e. the directory specified under params.eval_input.data_dir.

Example#

The following example runs multiple choice eval task winogrande:

python <path_to_modelzoo>/common/pytorch/run_cstorch_eval_harness.py \
  --params <path_to_params.yaml> \
  --tasks "winogrande" \
  --num_fewshot 0 \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers>
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \
  --logging "info" \

If everything is set up properly, the output logs should be as follows:

...
2023-11-28 11:21:15,674 INFO:   | Eval Device=CSX, Step=630, Rate=324.73 samples/sec, GlobalRate=166.46 samples/sec
2023-11-28 11:21:15,701 INFO:   | Eval Device=CSX, Step=632, Rate=325.12 samples/sec, GlobalRate=166.69 samples/sec
2023-11-28 11:21:15,727 INFO:   | Eval Device=CSX, Step=634, Rate=323.48 samples/sec, GlobalRate=166.93 samples/sec
2023-11-28 11:21:19,348 INFO:   Heartbeat thread stopped for executejob-002.
2023-11-28 11:21:19,403 INFO:   {
"results": {
    "winogrande": {
    "acc": 0.4956590370955012,
    "acc_stderr": 0.014051956064076911
    }
},
"versions": {
    "winogrande": 0
}
}
2023-11-28 11:21:19,791 INFO:
|   Task   |Version|Metric|Value |   |Stderr|
|----------|------:|------|-----:|---|-----:|
|winogrande|      0|acc   |0.4957|±  |0.0141|

To run more tasks, you may update the command to the following:

python <path_to_modelzoo>/common/pytorch/run_cstorch_eval_harness.py \
  --params <path_to_params.yaml> \
  --tasks "arc_challenge,arc_easy,boolq,hellaswag,openbookqa,race,truthfulqa_mc,winogrande" \
  --num_fewshot 0 \
  --checkpoint_path <path_to_checkpoint_file> \
  --python_paths <path(s)_to_export_to_PYTHONPATH_in_appliance_containers>
  --mount_dirs <path(s)_to_mount_to_appliance_containers> \
  --logging "info" \

The output should logs should be:

...
2023-11-28 12:13:11,117 INFO:   | Eval Device=CSX, Step=19902, Rate=231.31 samples/sec, GlobalRate=166.34 samples/sec
2023-11-28 12:13:13,662 INFO:   Heartbeat thread stopped for executejob-002.
2023-11-28 12:13:14,285 INFO:   {
"results": {
    "openbookqa": {
    "acc": 0.276,
    "acc_stderr": 0.02001121929807354,
    "acc_norm": 0.276,
    "acc_norm_stderr": 0.02001121929807354
    },
    "race": {
    "acc": 0.25933014354066986,
    "acc_stderr": 0.01356402513863557
    },
    "winogrande": {
    "acc": 0.4956590370955012,
    "acc_stderr": 0.014051956064076911
    },
    "arc_challenge": {
    "acc": 0.22696245733788395,
    "acc_stderr": 0.012240491536132861,
    "acc_norm": 0.22696245733788395,
    "acc_norm_stderr": 0.012240491536132861
    },
    "boolq": {
    "acc": 0.3782874617737003,
    "acc_stderr": 0.008482001133931005
    },
    "truthfulqa_mc": {
    "mc1": 1.0,
    "mc1_stderr": 0.0,
    "mc2": 0.4480984024255681,
    "mc2_stderr": 0.004443196346930688
    },
    "hellaswag": {
    "acc": 0.2504481179047998,
    "acc_stderr": 0.004323856300539177,
    "acc_norm": 0.2504481179047998,
    "acc_norm_stderr": 0.004323856300539177
    },
    "arc_easy": {
    "acc": 0.25084175084175087,
    "acc_stderr": 0.008895183010487395,
    "acc_norm": 0.25084175084175087,
    "acc_norm_stderr": 0.008895183010487395
    }
},
"versions": {
    "openbookqa": 0,
    "race": 1,
    "winogrande": 0,
    "arc_challenge": 0,
    "boolq": 1,
    "truthfulqa_mc": 1,
    "hellaswag": 0,
    "arc_easy": 0
}
}
2023-11-28 12:13:14,542 INFO:
|    Task     |Version| Metric |Value |   |Stderr|
|-------------|------:|--------|-----:|---|-----:|
|openbookqa   |      0|acc     |0.2760|±  |0.0200|
|             |       |acc_norm|0.2760|±  |0.0200|
|race         |      1|acc     |0.2593|±  |0.0136|
|winogrande   |      0|acc     |0.4957|±  |0.0141|
|arc_challenge|      0|acc     |0.2270|±  |0.0122|
|             |       |acc_norm|0.2270|±  |0.0122|
|boolq        |      1|acc     |0.3783|±  |0.0085|
|truthfulqa_mc|      1|mc1     |1.0000|±  |0.0000|
|             |       |mc2     |0.4481|±  |0.0044|
|hellaswag    |      0|acc     |0.2504|±  |0.0043|
|             |       |acc_norm|0.2504|±  |0.0043|
|arc_easy     |      0|acc     |0.2508|±  |0.0089|
|             |       |acc_norm|0.2508|±  |0.0089|

In addition to these logs, the output directory as specified under the command line argument --output_path contains dumped output metrics as well.

Supported Models#

Evaluation Harness on CS-2 is supported for several Model Zoo models including GPT2, GPT3, BTLM, BLOOM, LaMDA, LLaMA, Mistral, MPT, OPT, StarCoder, and SantaCoder. Use the params .yaml file to specify the desired model architecture.

Implementation notes#

Currently, only non-generative Evaluation Harness tasks are supported.

Release Notes#

In Release 2.1.0, we are supporting EEH version v0.3.0. Support for the latest version and autoregressive tasks will be added in subsequent releases.