-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hexagon] Run single RPC server on Android in each testing session #11547
[Hexagon] Run single RPC server on Android in each testing session #11547
Conversation
98fb3b0
to
2848ce7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be several things going on at the same time. Ideally, each self-contained change would go into its own PR, but if you want to combine them into one, could you list them in more detail in the PR description?
My two questions here are (1) whether we can do the same thing for hw and simulator, and (2) whether del
is a reliable way to close an RPC session.
android_serial_number, | ||
) -> HexagonLauncherRPC: | ||
"""Initials and returns hexagon launcher which reuses RPC info and Android serial number.""" | ||
if hexagon_server_process._serial_number != "simulator": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a difference here between hw and simulator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think simulator is more dependent to have all the execution files in the same workspace and since it was already working fine, I tried not to change it. We could also change that in this PR or a follow up PR.
@@ -93,7 +93,8 @@ def __enter__(self): | |||
raise exception | |||
|
|||
def __exit__(self, exc_type, exc_value, exc_traceback): | |||
pass | |||
# close session to the tracker | |||
del self._rpc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how well this works in python for actually destroying objects. It will unbind the name, but I don't think that there is any guarantee that the underlying object will actually be destroyed. @Lunderberg: do you have any opinions about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that this is the only reference to self._rpc
, this would destroy the object, because CPython uses a reference count in addition to generational garbage collection. That said, that's more of an implementation detail, and is pretty fragile to rely upon.
It looks like the key function we want to call is RPCEndpoint::Shutdown
, usually called when the RPCEndpoint
object is destructed. Unfortunately, I don't see any method in either RPCClientSession
or RPCSession
. I think we could add a close
method to the python RPCSession
object, which would then be called from __exit__
.
def close(self):
self._sess.__del__()
self._sess = None
@@ -109,7 +110,7 @@ def device(self): | |||
|
|||
return self._device | |||
|
|||
def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str): | |||
def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str) -> pathlib.Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this change to make this approach work. It is used in load_module
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the change from allowing just filenames to requiring a full remote path, so that the caller would have somewhere from which they could learn the full remove path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, previously some of the paths were only working if we were using the same working directory for HexagonLauncher and Session.
@@ -194,33 +196,36 @@ def get_graph_executor( | |||
self._set_device_type(graph_mod) | |||
return tvm.contrib.graph_executor.create(graph_json, graph_mod, self.device) | |||
|
|||
def get_aot_executor( | |||
def get_graph_debug_executor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like another unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately github didn't present this correctly. I removed get_aot_executor
function because it didn't work correctly and also it was not used anywhere in the codebase. Therefore it will remain broken if we keep it and others might try to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I changed get_graph_debug_executor
to work with the current approach. Just realized we are missing a test for that as well. I suggest to add a test for it since this api is useful for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a test for the get_aot_executor
function, rather than removing it entirely. The current tests use get_executor_from_factory
, which calls into _aot_executor_from_factory
. This is exactly what we need for CI, where each model is uploaded once and run once, but isn't as useful when a model is uploaded once and run many times, potentially across multiple executions.
The get_aot_executor
path is intended to support that use case, where an AOT module has been compiled and uploaded, and in each future occurrence only needs to be loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could either add a test for get_aot_executor or rework _aot_executor_from_factory
to use get_aot_executor
function. I prefer the latter.
python/tvm/contrib/hexagon/build.py
Outdated
@@ -58,7 +62,9 @@ def _get_hexagon_rpc_lib_dir() -> pathlib.Path: | |||
|
|||
def _get_test_directory_name() -> str: | |||
"""Generate a time-stamped name for use as a test directory name.""" | |||
return datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S") | |||
date_str = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think this is a good change to add to ensure unique folders even if a test concludes in less than a second, I think we should separate it out into an independent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for spinning it out, and #11593 is reviewed and approved!
or a full path in the remote system. If it is a file name, | ||
the file must already have been uploaded to the remote, | ||
and be placed in the remote workspace. | ||
be a full path in the remote system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the symmetry of loading a file using the same path as returned from upload
, but are there cases where we'd want to work within the existing workspace, without requiring the full path? The use cases coming to mind would be uploading a model once, then running several independent tests on it. Within a single process, the model to run should be passed to each independent target, so this change wouldn't affect it. If the upload happens in one process, then the independent tests happen in subsequent processes, it might cause an issue, but then the workspace itself would need to be passed through.
So I don't think I see any major concerns, but reducing previously functional behavior does give me pause.
@@ -194,33 +196,36 @@ def get_graph_executor( | |||
self._set_device_type(graph_mod) | |||
return tvm.contrib.graph_executor.create(graph_json, graph_mod, self.device) | |||
|
|||
def get_aot_executor( | |||
def get_graph_debug_executor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a test for the get_aot_executor
function, rather than removing it entirely. The current tests use get_executor_from_factory
, which calls into _aot_executor_from_factory
. This is exactly what we need for CI, where each model is uploaded once and run once, but isn't as useful when a model is uploaded once and run many times, potentially across multiple executions.
The get_aot_executor
path is intended to support that use case, where an AOT module has been compiled and uploaded, and in each future occurrence only needs to be loaded.
@@ -109,7 +110,7 @@ def device(self): | |||
|
|||
return self._device | |||
|
|||
def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str): | |||
def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str) -> pathlib.Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the change from allowing just filenames to requiring a full remote path, so that the caller would have somewhere from which they could learn the full remove path?
@Lunderberg I changed |
7a82d6b
to
9ac0be9
Compare
9ac0be9
to
c6bec0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the updates to the AOT executor handling, I think this looks good to me. As discussed, we should follow-up with an issue for explicit closing of an RPCSession
, because the del self._rpc
relies on the python garbage collection and so the timing of it is a bit fragile for long term use.
This PR changes Hexagon launcher in pytest config to reuse existing RPC server on Android side and only create new RPC server on the first test.
cc @Lunderberg @kparzysz-quic