From 7d48a4217ae1c4ac01325340560e45763a164464 Mon Sep 17 00:00:00 2001 From: Nicholas Sielicki Date: Thu, 5 Dec 2024 13:31:55 -0800 Subject: [PATCH] Reapply "defaults: make dmabuf opt-in" This reverts commit 224593f119a62f614106eac6718e2e1769f21b29. Our shared development cluster seems to have issues with dmabuf when running NCCL tests, for a handful of niche situations, ie: two nodes, with MPI_Comm_split equal to the number of GPUs, at 16GB+. Other environments seem not to have issues with the same workload, but out of an abundance of caution and due to a lack of root cause, this is being reverted again. Signed-off-by: Nicholas Sielicki --- include/nccl_ofi_param.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/include/nccl_ofi_param.h b/include/nccl_ofi_param.h index fe02afdf8..4464c6bad 100644 --- a/include/nccl_ofi_param.h +++ b/include/nccl_ofi_param.h @@ -272,14 +272,15 @@ OFI_NCCL_PARAM_INT(disable_gdr_required_check, "DISABLE_GDR_REQUIRED_CHECK", 0); * Unfortunately, the plugin needs to signal DMABUF support or lack thereof back * to NCCL prior to having an opportuntiy to make any any memory registrations. * This ultimately means that the plugin will opimistically assume DMA-BUF is - * viable on all FI_HMEM providers beyond libfabric 1.20. + * viable on all FI_HMEM providers beyond libfabric 1.20, if not for this param. * * If dmabuf registrations fail, (ie: if ibv_reg_dmabuf_mr cannot be resolved), * the plugin has no freedom to renegotiate DMABUF support with NCCL, and so it - * is fatal. Under those conditions, users should set this environment variable - * to force NCCL to avoid providing dmabuf file desciptors. + * is fatal. Under those conditions, users should ensure that they have set this + * environment variable to '1' to force NCCL to avoid providing dmabuf file + * desciptors. This is the default, pending perf investigations. */ -OFI_NCCL_PARAM_INT(disable_dmabuf, "DISABLE_DMABUF", 0); +OFI_NCCL_PARAM_INT(disable_dmabuf, "DISABLE_DMABUF", 1); /* * Messages sized larger than this threshold will be striped across multiple rails