Running jobs in parallel with the same model_dir causes issues#

Observed error#

Running several jobs in parallel with the same model_dir may cause issues.

Explanation#

When a model_dir is used in multiple parallel runs, it could cause a race condition (two operations being performed on the same object at the same time) as each run attempts to write their file-backed tensors to the same directory.

Workaround#

  • Use different model directories, or

  • Add a unique artifact sub-directory by generating a uuid.