Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: initial stab at TorchScript fallback #7401

Merged
merged 4 commits into from
Mar 9, 2022
Merged

RFC: initial stab at TorchScript fallback #7401

merged 4 commits into from
Mar 9, 2022

Conversation

t-vi
Copy link
Contributor

@t-vi t-vi commented Feb 3, 2021

This patch adds a support for calling TorchScript. This can be used fallback for when torch operators are not yet implemented or if one wants to incorporate bespoke PyTorch custom ops into TVM with ease.

It adds

  • a new relay torchop that takes a variable number of inputs and executes a provided TorchScript (aka PyTorch JIT) function,
  • a backend module class that calls into LibTorch (aka the C++ bindings underlying PyTorch), executing a TorchScript function.

The key addition to the relay infrastructure is that while leaving num_inputs == -1 on operator registration is documented to indicate a variable number of inputs, the type inference pass is not prepared to deal with it and instead requires the number of arguments provided to match the number of arguments with add_argument on operator registration. The proposed change to the type inference is then to match the declared arguments but to allow additional arguments if num_inputs is -1.

The other detail is that it uses a string attribute in the call node's attributes to take the serialized TorchScript. This is a bit fishy as the serialized representation is binary. I used serialization to get TorchScript into TVM at the C++ level as it is tricky to interoperate between PyBind-wrapped objects (TorchScript in PyTorch) and the TVM FFI, but we might pass things around as handles later (but I'm not sure if that works well with attributes). I would be glad to have your advice on this.

Currently I only support one output tensor and FP32, but that is straightforward to make flexible, and I would do this in parallel to the more fundamental discussion e.g. around the things above.

Even though this is in draft state, I'm opening this PR to make discussions concrete in terms of code.

I also posted a discussion to the forum.

@masahi
Copy link
Member

masahi commented Feb 3, 2021

This is an interesting use case of byoc, cc @zhiics @comaniac

@masahi
Copy link
Member

masahi commented Feb 3, 2021

I'm curious how it integrates with PyTorch frontend. Do we convert every op not supported to relay.torchop, run BYOC flow to get TorchScript subgraphs, and send them to libtorch? Sounds interesting!

@t-vi
Copy link
Contributor Author

t-vi commented Feb 4, 2021

I'm curious how it integrates with PyTorch frontend. Do we convert every op not supported to relay.torchop, run BYOC flow to get TorchScript subgraphs, and send them to libtorch? Sounds interesting!

This is how I'd like it to work out. I've been thinking what the best "level" is and while the operator level might seem attractive, but I I'm not sure there is an easy way to run individual operators.
It won't help with out favourite inplace problems, though.

@comaniac
Copy link
Contributor

comaniac commented Feb 4, 2021

I have the same question as masahi. IIUC, after this PR, the PyTorch frontend has the capablility to convert all unsupported ops to torchop so that we can guarantee the flow would work. This is an interesting idea and this would be the first BYOC use case that could potentially incopreate two BYOC backends (Torch and others like TensorRT).

cc @zhiics

@t-vi
Copy link
Contributor Author

t-vi commented Feb 5, 2021

Yeah, the general idea is to use this as the fallback. I can add the fallback generation here in the PR if that is better.
Also I added a bit of a pro-con discussion regarding single op vs. program on the forum thread, if you have opinions, I'd be very grateful if you could chime in.

@PhilippvK
Copy link
Contributor

Hey @t-vi, the idea of a fallback for unsupported TorchScript Ops is great. I am currently pursuing a similar approach for unsupported (and custom) TFLite Ops.

I also stumbled over the issue that num_inpust == -1 leads to problems in the type_infer step and "solved" it in a quite bad way by just packing all my inputs in a single Tuple and using num_inputs = 1. I would really appreciate getting at least your fix to solve this issue merged into upstream. Maybe in a separate PR at this is not really related to the TorchScript use case.

@t-vi
Copy link
Contributor Author

t-vi commented May 31, 2021

I would really appreciate getting at least your fix to solve this issue merged into upstream. Maybe in a separate PR at this is not really related to the TorchScript use case.

I'm all for it, but I wouldn't know how to add tests in lieu of something using it. If you or @masahi has any opinion on this, I'd be glad to split it out...

@t-vi
Copy link
Contributor Author

t-vi commented Nov 18, 2021

Just a quick note that when I tried to revive this back in the summer it got a bit stalled around pytorch/pytorch#65047 .

@masahi
Copy link
Member

masahi commented Jan 9, 2022

@t-vi Is this still WIP? I'm happy to merge whenever you think it is ready.

@t-vi
Copy link
Contributor Author

t-vi commented Jan 10, 2022

@masahi So I had hoped to get the dlpack header version in PyTorch bumped (see the linked bug) but Facebook has internal uses that let it insist on the old one.
I wonder if we could work around it by providing a "dlpack-compat" header that defines the names for the types / fields? Or I could try to somehow separate the two.

@masahi
Copy link
Member

masahi commented Jan 10, 2022

I wonder if we could work around it by providing a "dlpack-compat" header

Does this mean the same thing as pytorch/pytorch#65047 (comment)? (which sounds good to me). Anyway it seems we cannot count on the PyTorch-side to change, so I'd say anything that can unblock you from our side should be a good plan. cc @tqchen @junrushao1994

@masahi
Copy link
Member

masahi commented Jan 10, 2022

As commented by the author of PyTorchTVM, apache/tvm-rfcs#25 (comment), many people are interested in this feature. Also people are actively talking about deeper integration of TVM into PyTorch workflow. So we should definitely land this.

As for me personally, I want to try out this feature to import models having fft-related ops (specifically, this cool model https://github.com/raoyongming/GFNet). FFT ops won't come to TVM any time soon.

@t-vi
Copy link
Contributor Author

t-vi commented Jan 10, 2022

So I thought, I could wait it out, but I'll look into working around the version discrepancy in the next few weeks.

@masahi
Copy link
Member

masahi commented Jan 11, 2022

Another interesting use case this fallback could enable is mmdetection https://github.com/open-mmlab/mmdetection. It has a lot of cool detection models but rely on many custom ops that cannot convert to relay.

@t-vi
Copy link
Contributor Author

t-vi commented Jan 21, 2022

M. Ruberry of the PyTorch team re-landed the update of the dlpack.h in PyTorch. If this still holds next week, it'll be exciting to bring this up to date. :)

@t-vi t-vi requested a review from icemelon as a code owner March 8, 2022 06:46
@t-vi
Copy link
Contributor Author

t-vi commented Mar 8, 2022

Hi, so I rebased this finally and it all compiles and runs one test against a current PyTorch master, so I think I'm back in business with this PR (unless it has been obsoleted, but from what I understand, the bridge is in the other direction).

@masahi
Copy link
Member

masahi commented Mar 8, 2022

Great! Can you remove WIP from the title now?

@t-vi
Copy link
Contributor Author

t-vi commented Mar 8, 2022

Are you on the tvm discord or so to quickly discuss?

@t-vi t-vi changed the title WIP/RFC: initial stab at TorchScript fallback RFC: initial stab at TorchScript fallback Mar 8, 2022
@masahi
Copy link
Member

masahi commented Mar 8, 2022

yeah I've just logged in.

@masahi masahi merged commit fe33ed6 into apache:main Mar 9, 2022
@t-vi
Copy link
Contributor Author

t-vi commented Mar 9, 2022

Thank you @masahi !

pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
* initial stab at TorchScript fallback

* make types more flexible

* Easy review bits. Thank you @masahi
@sunggg
Copy link
Contributor

sunggg commented Sep 29, 2022

Hi, @t-vi. Thank you for the great contribution! This seems very interesting.
I'm trying to check out this feature but having some ABI link issue during setup - ld.lld: error: undefined symbol.
This seems like a known issue (link), so I followed their suggestion of adding -D_GLIBCXX_USE_CXX11_ABI=0".
However, this change makes TVM llvm codegen fails with ABI link.
Have you faced this issue before?
I'm wondering what will be the general guideline to fix such ABI conflicts.
Thank you!

For more detailed info, I installed pytorch package with conda and downloaded prebuilt binary by following the instruction.
Then, I modified my config.make to set(USE_LIBTORCH ~/libtorch).

@t-vi
Copy link
Contributor Author

t-vi commented Sep 29, 2022

The fundamental problem is that (pre-compiled) PyTorch python modules use the pre C++-11 string ABI to better blend into the Python ecosystem or so. TVM does not, so it needs to link to LibTorch with the "new" C++ string ABI.
But these two versions clash.

One option is to use self-compiled PyTorch (which is what I do and so got us into this mess...).

I've been trying on and off to look into linking LibTorch such that all symbols are hidden (and you can load PyTorch with whatever C++ ABI later), but I have not managed yet.

@sunggg
Copy link
Contributor

sunggg commented Sep 29, 2022

@t-vi , thank you for the quick reply!
Okay, I wanted to avoid compiling PyTorch by myself but seems like there is no other option now haha
Let me try! Appreciate it :)

@shingjan
Copy link

@sunggg This pr (#12232) used a different way to solve the version mismatch of c++ ABI between TVM and pytorch, maybe worth looking into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants