Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVTX tracing support #377

Merged
merged 5 commits into from
Apr 17, 2024
Merged

Add NVTX tracing support #377

merged 5 commits into from
Apr 17, 2024

Conversation

rauteric
Copy link
Contributor

@rauteric rauteric commented Apr 4, 2024

Add support for NVTX tracepoints (RDMA protocol only). Enabled at compile-time with --with-nvtx.

Re-vamped version of #363 from Nick.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rauteric rauteric requested review from bwbarrett and a team as code owners April 4, 2024 22:07
@rauteric
Copy link
Contributor Author

rauteric commented Apr 4, 2024

Oops, this clearly isn't building when NVTX support isn't enabled, need to fix that.

@rauteric rauteric marked this pull request as draft April 4, 2024 22:10
@rauteric rauteric force-pushed the nvtx-poc-simplified branch from e341109 to 690851e Compare April 4, 2024 23:05
@rauteric
Copy link
Contributor Author

rauteric commented Apr 4, 2024

Build should be working now.

Remaining TODOs include:

  1. Address comments from feat(tracing): add nvtx provider #363
  2. Improve documentation of the PER_COMM and PER_DEV options
  3. Note in commit msg that only RDMA protocol is supported

@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 690851e to 32cd720 Compare April 4, 2024 23:07
@rauteric
Copy link
Contributor Author

rauteric commented Apr 4, 2024

(Force push to 32cd720 is a rebase on master)

@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 32cd720 to 2473dda Compare April 4, 2024 23:32
@rauteric
Copy link
Contributor Author

rauteric commented Apr 4, 2024

Resolved some comments from the previous PR.

@rauteric rauteric marked this pull request as ready for review April 4, 2024 23:37
@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 2473dda to 5869612 Compare April 4, 2024 23:40
include/nccl_ofi_rdma.h Show resolved Hide resolved
include/tracing_impl/nvtx.h Show resolved Hide resolved
include/nccl_ofi_tracepoint.h Outdated Show resolved Hide resolved
include/tracing_impl/nvtx.h Show resolved Hide resolved
@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 5869612 to 2338346 Compare April 8, 2024 19:23
@rauteric
Copy link
Contributor Author

rauteric commented Apr 8, 2024

Updated documentation on NCCL_OFI_NVTX_* options and moved them into nvtx.h.

Copy link
Contributor

@aws-nslick aws-nslick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move build-time configuration into config.h, not hardcoded defines which must be manually modified at the top of nvtx.h

@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 2338346 to 8c32c34 Compare April 9, 2024 22:41
@rauteric rauteric requested a review from aws-nslick April 9, 2024 22:42
@rauteric
Copy link
Contributor Author

rauteric commented Apr 9, 2024

Please move build-time configuration into config.h, not hardcoded defines which must be manually modified at the top of nvtx.h

Did this

src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
@aws-nslick
Copy link
Contributor

There are a few guards that I would like to be refined but feel this can land. Surely we're going to need to iterate on this again.

@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 8c32c34 to 6d326fc Compare April 10, 2024 21:44
@rauteric rauteric requested a review from aws-nslick April 10, 2024 21:46
@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 6d326fc to 04d5d66 Compare April 11, 2024 18:24
@rauteric
Copy link
Contributor Author

Force-push to 04d5d66 is a rebase on master.

m4/check_pkg_nvtx.m4 Show resolved Hide resolved
m4/check_pkg_nvtx.m4 Outdated Show resolved Hide resolved
include/tracing_impl/nvtx.h Show resolved Hide resolved
@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 04d5d66 to e61639d Compare April 11, 2024 23:23
@rauteric
Copy link
Contributor Author

Fixed a bug where RDMA segments and eager used same trace, but the latter was never completed. Split eager into a separate trace type for NVTX.

@rauteric rauteric force-pushed the nvtx-poc-simplified branch from e61639d to a9ab67c Compare April 11, 2024 23:44
@rauteric
Copy link
Contributor Author

  • Made domains_per_comm a define
  • Check that only one of per_comm/per_dev is defined

@rauteric rauteric force-pushed the nvtx-poc-simplified branch from a9ab67c to 055250e Compare April 11, 2024 23:50
@rauteric
Copy link
Contributor Author

Force-push to 055250e: fix broken non-NVTX build

@rauteric rauteric dismissed aws-nslick’s stale review April 11, 2024 23:57

Resolved comments

aws-nslick and others added 5 commits April 15, 2024 14:25
This doesn't belong here and is likely a copy/paste mistake. It's
unclear why this didn't break anything: this could potentially cause
very bizarre header inclusions under neuron when tracing is enabled.
Resolve nvToolsExt and hook into configuration. Implementation follows
in future commits.

Signed-off-by: Eric Raut <[email protected]>
Separate each tracepoint/profiling provider (currently, just the one)
into its own header.

Signed-off-by: Eric Raut <[email protected]>
For upcoming NVTX support, it will make sense to have separate tracing
functions for SENDRECV and RDMA protocols. We might consider placing
them in separate files in the future.

Signed-off-by: Eric Raut <[email protected]>
Hook nvtx on existing lttng macros.

We figured out how to structure this in a way that
aligns the required usages of nvtx with cases
like NCCL_OFI_TRACE_SEND_WRITE_SEG COMPLETE/START. We use the NVTX
start/end API for ranges, and mark API for events.

Only supports RDMA protocol for now, SENDRECV protocol NVTX support will
be added in the future.

Signed-off-by: Eric Raut <[email protected]>
@rauteric rauteric force-pushed the nvtx-poc-simplified branch from 055250e to ecc2fdb Compare April 15, 2024 21:26
@rauteric
Copy link
Contributor Author

Rebase on master

@rauteric rauteric merged commit a3aea9e into aws:master Apr 17, 2024
13 checks passed
@rauteric rauteric deleted the nvtx-poc-simplified branch April 17, 2024 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants