Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Run single RPC server on Android in each testing session #11547

Merged
merged 5 commits into from
Jun 10, 2022

Conversation

mehrdadh
Copy link
Member

@mehrdadh mehrdadh commented Jun 2, 2022

This PR changes Hexagon launcher in pytest config to reuse existing RPC server on Android side and only create new RPC server on the first test.

cc @Lunderberg @kparzysz-quic

@mehrdadh mehrdadh marked this pull request as ready for review June 2, 2022 17:45
@mehrdadh mehrdadh marked this pull request as draft June 2, 2022 17:50
@mehrdadh mehrdadh force-pushed the hexagon/single_hexagon_session branch 2 times, most recently from 98fb3b0 to 2848ce7 Compare June 3, 2022 17:03
@mehrdadh mehrdadh marked this pull request as ready for review June 3, 2022 17:03
Copy link
Contributor

@kparzysz-quic kparzysz-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be several things going on at the same time. Ideally, each self-contained change would go into its own PR, but if you want to combine them into one, could you list them in more detail in the PR description?
My two questions here are (1) whether we can do the same thing for hw and simulator, and (2) whether del is a reliable way to close an RPC session.

android_serial_number,
) -> HexagonLauncherRPC:
"""Initials and returns hexagon launcher which reuses RPC info and Android serial number."""
if hexagon_server_process._serial_number != "simulator":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a difference here between hw and simulator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think simulator is more dependent to have all the execution files in the same workspace and since it was already working fine, I tried not to change it. We could also change that in this PR or a follow up PR.

@@ -93,7 +93,8 @@ def __enter__(self):
raise exception

def __exit__(self, exc_type, exc_value, exc_traceback):
pass
# close session to the tracker
del self._rpc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how well this works in python for actually destroying objects. It will unbind the name, but I don't think that there is any guarantee that the underlying object will actually be destroyed. @Lunderberg: do you have any opinions about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that this is the only reference to self._rpc, this would destroy the object, because CPython uses a reference count in addition to generational garbage collection. That said, that's more of an implementation detail, and is pretty fragile to rely upon.

It looks like the key function we want to call is RPCEndpoint::Shutdown, usually called when the RPCEndpoint object is destructed. Unfortunately, I don't see any method in either RPCClientSession or RPCSession. I think we could add a close method to the python RPCSession object, which would then be called from __exit__.

def close(self):
    self._sess.__del__()
    self._sess = None

@@ -109,7 +110,7 @@ def device(self):

return self._device

def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str):
def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str) -> pathlib.Path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unrelated change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this change to make this approach work. It is used in load_module function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the change from allowing just filenames to requiring a full remote path, so that the caller would have somewhere from which they could learn the full remove path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, previously some of the paths were only working if we were using the same working directory for HexagonLauncher and Session.

@@ -194,33 +196,36 @@ def get_graph_executor(
self._set_device_type(graph_mod)
return tvm.contrib.graph_executor.create(graph_json, graph_mod, self.device)

def get_aot_executor(
def get_graph_debug_executor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another unrelated change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately github didn't present this correctly. I removed get_aot_executor function because it didn't work correctly and also it was not used anywhere in the codebase. Therefore it will remain broken if we keep it and others might try to use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I changed get_graph_debug_executor to work with the current approach. Just realized we are missing a test for that as well. I suggest to add a test for it since this api is useful for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a test for the get_aot_executor function, rather than removing it entirely. The current tests use get_executor_from_factory, which calls into _aot_executor_from_factory. This is exactly what we need for CI, where each model is uploaded once and run once, but isn't as useful when a model is uploaded once and run many times, potentially across multiple executions.

The get_aot_executor path is intended to support that use case, where an AOT module has been compiled and uploaded, and in each future occurrence only needs to be loaded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could either add a test for get_aot_executor or rework _aot_executor_from_factory to use get_aot_executor function. I prefer the latter.

@@ -58,7 +62,9 @@ def _get_hexagon_rpc_lib_dir() -> pathlib.Path:

def _get_test_directory_name() -> str:
"""Generate a time-stamped name for use as a test directory name."""
return datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
date_str = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this is a good change to add to ensure unique folders even if a test concludes in less than a second, I think we should separate it out into an independent PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spinning it out, and #11593 is reviewed and approved!

or a full path in the remote system. If it is a file name,
the file must already have been uploaded to the remote,
and be placed in the remote workspace.
be a full path in the remote system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the symmetry of loading a file using the same path as returned from upload, but are there cases where we'd want to work within the existing workspace, without requiring the full path? The use cases coming to mind would be uploading a model once, then running several independent tests on it. Within a single process, the model to run should be passed to each independent target, so this change wouldn't affect it. If the upload happens in one process, then the independent tests happen in subsequent processes, it might cause an issue, but then the workspace itself would need to be passed through.

So I don't think I see any major concerns, but reducing previously functional behavior does give me pause.

@@ -194,33 +196,36 @@ def get_graph_executor(
self._set_device_type(graph_mod)
return tvm.contrib.graph_executor.create(graph_json, graph_mod, self.device)

def get_aot_executor(
def get_graph_debug_executor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a test for the get_aot_executor function, rather than removing it entirely. The current tests use get_executor_from_factory, which calls into _aot_executor_from_factory. This is exactly what we need for CI, where each model is uploaded once and run once, but isn't as useful when a model is uploaded once and run many times, potentially across multiple executions.

The get_aot_executor path is intended to support that use case, where an AOT module has been compiled and uploaded, and in each future occurrence only needs to be loaded.

@@ -109,7 +110,7 @@ def device(self):

return self._device

def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str):
def upload(self, local_path: Union[str, pathlib.Path], remote_filename: str) -> pathlib.Path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the change from allowing just filenames to requiring a full remote path, so that the caller would have somewhere from which they could learn the full remove path?

@mehrdadh
Copy link
Member Author

mehrdadh commented Jun 7, 2022

@Lunderberg I changed _aot_executor_from_factory to use get_aot_executor. PTKL, thanks!

@mehrdadh mehrdadh force-pushed the hexagon/single_hexagon_session branch from 7a82d6b to 9ac0be9 Compare June 8, 2022 16:58
@mehrdadh mehrdadh force-pushed the hexagon/single_hexagon_session branch from 9ac0be9 to c6bec0d Compare June 8, 2022 17:08
Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the updates to the AOT executor handling, I think this looks good to me. As discussed, we should follow-up with an issue for explicit closing of an RPCSession, because the del self._rpc relies on the python garbage collection and so the timing of it is a bit fragile for long term use.

@kparzysz-quic kparzysz-quic merged commit dc522a6 into apache:main Jun 10, 2022
@mehrdadh mehrdadh deleted the hexagon/single_hexagon_session branch June 14, 2022 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants