Skip to content

Running on alihlt #45

@saganatt

Description

@saganatt

(Just keeping notes for myself. Maybe someone else will find these useful as well.)

Running the script on alihlt

Version with virtualenv

Prerequisities - to be installed by admin:

  • python3-virtualenv
  • graphviz
  • ROOT prerequsities

Running

  1. Install ROOT 6. Add to ~/.bashrc:
export PATH=/opt/rocm/bin:$PATH
export ALIBUILD_WORK_DIR="$HOME/alice/sw"
eval "`alienv shell-helper`"

Reload shell.
2. Add PYTHONPATH=/home/${LOGNAME}/.virtualenvs/tpcwithdnn/lib/python3.6/site-packages/:$PYTHONPATH to load.sh:89 and comment LD_LIBRARY_PATH line.
3. Copy input data from aliceml and change paths in database*.yml (/home/mkabus/data/...).
4.

alienv enter ROOT/latest
source load.sh
pip uninstall tf-nightly-gpu
pip install tensorflow-rocm
  1. In utilities_dnn.py:58 replace pool_type with 1 (forcing AveragePooling3D, MaxPooling3D causes: "3D pooling doesn't support workspace index mask mode" error).
  2. Change run_parallel to true in database*.yml.
  3. In dnn_optimiser.py:58 set devices explicitly, for 6 devices: self.strategy = MirroredStrategy(devices=["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3", "/gpu:4", "/gpu:5"])

Debugging

Comments to a tensorflow issue
ROCM guide on HIP debugging
ROCM guide on system-level debugging

Profiling

ROCM guide

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions