Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true #738

Closed
rtuck99 opened this issue Jan 9, 2025 · 5 comments · Fixed by #746
Closed

GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true #738

rtuck99 opened this issue Jan 9, 2025 · 5 comments · Fixed by #746
Assignees

Comments

@rtuck99
Copy link
Contributor

rtuck99 commented Jan 9, 2025

During testing of mx-bluesky 1.4.1 (see #697) it was discovered that although gda.mx.hyperion.xrc.compare_cpu_and_gpu=true was enabled in domain.properties, GPU gridscan processing was not occurring because dev_shm wasn't being enabled; this was causing a long wait for the GPU zocalo results which never arrived until the wait times out. The remainder of testing had to proceed with the GPU setting disabled.

Acceptance Criteria

  • The above flag works correctly with GPU processing.
@rtuck99 rtuck99 changed the title GPU GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true Jan 9, 2025
@rtuck99 rtuck99 changed the title GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true Jan 9, 2025
@rtuck99 rtuck99 changed the title GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true GPU processing not working with gda.mx.hyperion.xrc.compare_cpu_and_gpu=true Jan 9, 2025
@rtuck99 rtuck99 added this to Hyperion Jan 9, 2025
@rtuck99 rtuck99 moved this to Todo This Sprint in Hyperion Jan 9, 2025
@shihab-dls shihab-dls self-assigned this Jan 10, 2025
@shihab-dls shihab-dls moved this from Todo This Sprint to In Progress in Hyperion Jan 10, 2025
@shihab-dls
Copy link
Collaborator

shihab-dls commented Jan 13, 2025

See Gerrit change: https://gerrit.diamond.ac.uk/c/gda/gda-mx/+/43129

This PR relates to these changes

@rtuck99
Copy link
Contributor Author

rtuck99 commented Jan 13, 2025

This parameter was indeed being sent to hyperion, see
https://graylog.diamond.ac.uk/search/678530773022d06c784f1dea
The request is started with

 "features": {
    "compare_cpu_and_gpu_zocalo": true
  },

and the comparison is started, but the GPU results are never received

2024-12-04 13:20:33.929	i03-control.diamond.ac.uk
[32m[I 241204 13:20:33.929 zocalo_results:285] Zocalo results from CPU processing: found 2 crystals.�[0m
2024-12-04 13:20:33.928	i03-control.diamond.ac.uk
[33m[W 241204 13:20:33.928 zocalo_results:276] Zocalo results from GPU timed out. Using results from CPU�[0m
2024-12-04 13:19:03.928	i03-control.diamond.ac.uk
[33m[W 241204 13:19:03.927 zocalo_results:239] Received zocalo results from CPU before GPU�[0m

I think it is because these lines are never called with enable_dev_shm == 1

    def change_dev_shm(self, enable_dev_shm: bool):
        LOGGER.info(f"{'Enabling' if enable_dev_shm else 'Disabling'} dev shm")
        return self.odin.fan.dev_shm_enable.set(1 if enable_dev_shm else 0)

@olliesilvester
Copy link
Contributor

There are two tests which I would've expected to catch this

def test_feature_flags_overriden_if_supplied(minimal_3d_gridscan_params):
    test_params = HyperionThreeDGridScan(**minimal_3d_gridscan_params)
    assert test_params.features.use_panda_for_gridscan is False
    assert test_params.features.compare_cpu_and_gpu_zocalo is False
    minimal_3d_gridscan_params["features"] = {
        "use_panda_for_gridscan": True,
        "compare_cpu_and_gpu_zocalo": True,
    }
    test_params = HyperionThreeDGridScan(**minimal_3d_gridscan_params)
    assert test_params.features.compare_cpu_and_gpu_zocalo
    assert test_params.features.use_panda_for_gridscan
    # Config server shouldn't update values which were explicitly provided
    test_params.features.update_self_from_server()
    assert test_params.features.compare_cpu_and_gpu_zocalo
    assert test_params.features.use_panda_for_gridscan


@patch("mx_bluesky.common.parameters.components.os")
def test_gpu_enabled_if_use_gpu_or_compare_gpu_enabled(_, minimal_3d_gridscan_params):
    minimal_3d_gridscan_params["detector_distance_mm"] = 100

    grid_scan = HyperionThreeDGridScan(**minimal_3d_gridscan_params)
    assert not grid_scan.detector_params.enable_dev_shm

    minimal_3d_gridscan_params["features"] = {
        "compare_cpu_and_gpu_zocalo": True,
    }
    grid_scan = HyperionThreeDGridScan(**minimal_3d_gridscan_params)
    assert grid_scan.detector_params.enable_dev_shm

I wonder what's missing

@olliesilvester
Copy link
Contributor

olliesilvester commented Jan 13, 2025

Two problems:

  1. Our RobotLoadThenCentre params doesn't use Features since it inherits from GridCommon rather than HyperionThreeDGridScan
  2. For some reason we've set our top level param BaseModel to allow extra fields without erroring. (Git has blamed me for this 😢 .) Maybe we wanted it at some point, but it's definitely better to error if there's extra fields now

@olliesilvester
Copy link
Contributor

The third problem was that the parameter model which was being used when the Eiger was being armed, RobotLoadThenCentre, used the DetectorParams from GridCommon, which didn't give the option for dev_shm. This has now been addressed in #746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants