Skip to content

Output for adjacency matrices, node features #2

@anastasiiakim

Description

@anastasiiakim

Hi, fixed some bugs related to import and it ran afterwards. On a single GPU it produced results and images of generated graphs.

  1. Does it save to file somewhere the adjacency matrices of generated graphs and node features? I only see images, stats, and checkpoints and empty 'chains' folder.

  2. Also it seems that there is an error using multiple GPUs, can you look at this?

python src/main.py +experiment=comm20.yaml general.gpus=[0,1]

[rank0]:[E423 16:29:23.556699680 ProcessGroupNCCL.cpp:632] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800067 milliseconds before timing out.
[rank0]:[E423 16:29:23.557495842 ProcessGroupNCCL.cpp:2268] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 1347 PG status: last enqueued work: 1347, last completed work: 1346
[rank0]:[E423 16:29:23.557515023 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E423 16:29:23.557554875 ProcessGroupNCCL.cpp:2103] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping.
[rank0]:[E423 16:29:24.358637843 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1347, last completed NCCL work: 1346.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions