-
Notifications
You must be signed in to change notification settings - Fork 23.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[export] Capture tensor.to() under export. #123732
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123732
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 24014a8 with merge base 674e15a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D55969674 |
3cb9c77
to
794ba77
Compare
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph. Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to Differential Revision: D55969674
This pull request was exported from Phabricator. Differential Revision: D55969674 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh very cool
test failures look fine to fix. will update this later |
@@ -388,7 +388,7 @@ static inline Tensor to_impl( | |||
c10::optional<c10::MemoryFormat> optional_memory_format) { | |||
|
|||
// fast path | |||
if (to_will_alias(self, dtype, layout, device, copy, optional_memory_format)) { | |||
if (to_will_alias(self, dtype, layout, device, copy, optional_memory_format) && self.const_data_ptr()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment for const_data_ptr(), it's not trivial this is for fakeTensor/functionalTensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is a proxy check for under FakeMode?
maybe we can turn this into an explicit check for under FakeMode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the semantics of user code to make Tensor.to always copy under torch.export (and also torch.compile, it looks like?). I don't think that's what we want. A test case is the following (replace torch.compile with torch.export too)
@torch.compile
def f(x):
y = x.clone()
z = y.to("cpu")
y.sin_()
return z
x = torch.randn(3)
out = f(x)
assert torch.allclose(out, x.sin())
@zou3519 I understand that in this particular case, we want to preserve the old behavior, so I don't think we necessarily want to dispatch to _to_copy always. Instead, we just want to preserve the device conversion semantics. I saw there's actually an aten::to which actually does this. https://fburl.com/code/8ax41xzx Could we dispatch to these ops instead? |
Is your eventual goal that you'd to export a graph that is device agnostic and the Tensor.to is one of the things that is preventing this from working? |
Yes. Also I want to provide a little bit more context on device issue: export team has concluded in the design meeting that we will bake in device from time to time due to how pt2 works, BUT for case like tensor.to() which is authored by user, we just want to preserve this in the captured graph which means at least 1 op should be captured by torch.export() in this case, otherwise it'd be a big surprise to people who rely on the graph. For device bake ins caused by other reasons e.g. op decomposition and so on, we don't care that much right now and the biggest thing we want to fix is tensor.to(device) from user code. |
Our constraints are that we need to preserve the semantics of the user program. This is made complicated by how PT2 IR is functional -- a call to something like aten::to isn't functional and therefore it needs to be decomposed. One middle ground is, when
|
@zou3519 If we could do the middle ground approach you proposed, it will be helpful as well.
nvm I think I understand it now, sorry. |
@bdhirsh, @zhxchen17, and I talked offline. In general, it seems wrong behaviour to support mutation on the result of tensor.to because depending on whether you used CUDA tensor or CPU tensor, the behaviour of export program will be different. Consider following example:
In above code, we will mutate input if x is a CPU tensor while we won't if it is a CUDA tensor. As a result, i think it makes sense to just always decompose aten.to to aten._to_copy and ban the mutation on the output of aten.to. |
Banning the mutation seems good as long as we can raise an error on it. The tricky thing is being sure that we will actually raise the exception if a mutation happens. |
794ba77
to
0c2239d
Compare
This pull request was exported from Phabricator. Differential Revision: D55969674 |
@pytorchbot rebase |
0c2239d
to
1b37acc
Compare
e09d7c5
to
cdee395
Compare
This pull request was exported from Phabricator. Differential Revision: D55969674 |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
cdee395
to
cb01f9f
Compare
This pull request was exported from Phabricator. Differential Revision: D55969674 |
cb01f9f
to
23a3d32
Compare
This pull request was exported from Phabricator. Differential Revision: D55969674 |
Summary: Pull Request resolved: pytorch#123732 We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph. Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to Reviewed By: tugsbayasgalan, angelayi Differential Revision: D55969674
23a3d32
to
24014a8
Compare
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph. Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to Differential Revision: D55969674 Pull Request resolved: pytorch#123732 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph. Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to Differential Revision: D55969674 Pull Request resolved: #123732 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
Summary: We use to skip tensor.to() during tracing when the device is the same. This will bring some performance improvement in eager but making graph capture losing the semantics from original model. In this diff, we add an additional condition to skip the fast path when we don't have actual data inside a tensor, which is the case when we're using FakeTensor / FunctionalTensor to trace the model. This won't have perf impact on previous eager models while making sure we can capture the _to_copy() node in the graph.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r device_to
Differential Revision: D55969674