Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for hybrid full chain with wc/ls stage separate from follow-on pure-WCT stage #131

Merged
merged 23 commits into from
Nov 11, 2021

Conversation

brettviren
Copy link
Member

This starts by adding depo tar streams to gain better compression than what cnpy provides (none).

It also includes one small tweak to DepoBagger which will bag all depos regardless of their time if the time "gate" is [0,0]. For existing configs, this will not change behavior.

Draft for now as I work out any more issues along the way to a full hybrid job.

…torch.

This is probably not the last on libtorch suport
Needed to directly load into the GPU.

This is with libtorch 1.9.
…orch script node

TorchService should use less memory and be thread safe but needs more testing
…on HDF5 to remove illegal cross dependency on hio
… the anode plane

Note, this is a broken concept in general and will only work accidentally with some detectors.
There is a stand-alone PlaneSelector to do this and DNNROIFinding
changes from cbeg/cend to using the same pattern internally.

This is based on a channel selection from OmnibusSigProc that is moved
into a new Aux/PlaneTools.h.  But, OmnibusSigProc still is left as-is.
It should be factored to use this moved code.
@brettviren
Copy link
Member Author

This is ready for others to look at. In particular, @HaiwangYu

Here is one easy way to exercise things:

wire-cell -l stdout -L debug -c test/test-pdsp-sim-sp-dnnroi.jsonnet
wirecell-plot ntier-frames -o foo.pdf test-pdsp-ssd-orig0.tar.bz2 test-pdsp-ssd-wiener0-gauss0.tar.bz2 test-pdsp-ssd-dnnsp0.tar.bz2

The Jsonnet is in cfg/test/ and the job will need the torch script model file unet-l23-cosmic500-e50.ts in current directory. (Hmm, something for later: should make .ts files loadable via WIRECELL_PATH)

The wirecell-plot comes from the most recent wire-cell-python.

Resulting PDF should show ADC, gauss and dnnroi frames with the simple cross-detector line source.

I have some config in my frappe and toyzero packages that will read in depos dumped from larsoft, but I think this simple test gives the main gist. Here is the pgraph:

pdspssd

Note, this runs torch on the CPU with concurrency 1. To try gpu or gpucpu (try GPU, fail over to CPU) or with higher concurrency, for now, just hack these values for the TorchService config object in the main Jsonet file.

@brettviren brettviren marked this pull request as ready for review October 28, 2021 20:06
@brettviren brettviren requested a review from HaiwangYu October 28, 2021 20:07
@HaiwangYu
Copy link
Member

Hi @brettviren, sorry for the delay.

One thing I noticed first is that the gpvm UPS version of libtorch doesn't have cuda support. So maybe we need to add #ifdef HAVE_CUDA_INC around here and here?

Otherwise, the compiling would fail with the gpvm libtorch

In file included from /cvmfs/larsoft.opensciencegrid.org/products/libtorch/v1_6_0b/Linux64bit+3.10-2.17-e20/include/c10/cuda/CUDACachingAllocator.h:4,
                 from ../pytorch/src/TorchService.cxx:99:
/cvmfs/larsoft.opensciencegrid.org/products/libtorch/v1_6_0b/Linux64bit+3.10-2.17-e20/include/c10/cuda/CUDAStream.h:6:10: fatal error: cuda_runtime_api.h: No such file or directory
    6 | #include <cuda_runtime_api.h>
      |          ^~~~~~~~~~~~~~~~~~~~

I am still working on other testings and evaluations of this PR.

@brettviren
Copy link
Member Author

Oh, good catch. Actually, based on our discussion, I meant to remove that cache purge which will also remove this explicit dependency. Since I've been developing on top of this branch for the upcoming IDFT component and we won't make a release immediately, I'll commit a fix there. Unless not having it here is a blocker.

@HaiwangYu
Copy link
Member

Since I've been developing on top of this branch for the upcoming IDFT component and we won't make a release immediately, I'll commit a fix there.

For sure, that works too.

Copy link
Member

@HaiwangYu HaiwangYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brettviren, I marked several small things. Most are just for myself.
I think we can merge it and fix them later.

Did some tests here:
https://www.phy.bnl.gov/~yuhw/wct-analysis/ml-sp/hybrid-wcls-test


local app_plugins = {
'TbbFlow': ["WireCellTbb"],
'PGrapher': ["WireCellPgraph"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Pgrapher

@@ -5,17 +5,89 @@
#ifndef WIRECELLPYTORCH_TSMODEL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this inconsistent name of the include guard. Should change it next time.


## when building with __DEBUG__ defined in DNNROIFinding you need to
## temporarily add HDF5 as a dependency.
# bld.smplpkg('WireCellPytorch', use='WireCellAux LIBTORCH HDF5')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need both HDF5 and H5CPP

{
std::lock_guard<std::mutex> guard(Hio::g_h5cpp_mutex);
h5::fd_t fd = h5::open(m_cfg["evalfile"].asString(), H5F_ACC_RDWR);
h5::fd_t fd = h5::open(m_cfg.debugevalfile, H5F_ACC_RDWR);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be debugfile

@@ -283,35 +287,25 @@ bool Pytorch::DNNROIFinding::operator()(const IFrame::pointer& inframe, IFrame::
aname = String::format("/%d/frame_%s%d", m_save_count, "dlcharge", m_anode->ident());
h5::write<float>(fd, aname, sp_charge.data(), h5::count{ncols, nrows}, h5::chunk{ncols, nrows} | h5::gzip{2});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_anode was removed. need to change to Factory::find_tn<IAnodePlane>(m_cfg.anode)

@HaiwangYu HaiwangYu merged commit a966fcb into master Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants