Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel and lazy connection of Devices #415

Closed
coretl opened this issue Apr 4, 2024 · 20 comments
Closed

Support parallel and lazy connection of Devices #415

coretl opened this issue Apr 4, 2024 · 20 comments

Comments

@coretl
Copy link
Collaborator

coretl commented Apr 4, 2024

At the moment we have something like:

def dcm(wait_for_connection: bool = True, fake_with_ophyd_sim: bool = False) -> DCM:
"""Get the i03 DCM device, instantiate it if it hasn't already been.
If this is called when already instantiated in i03, it will return the existing object.
"""
return device_instantiation(
DCM,
"dcm",
"",
wait_for_connection,
fake_with_ophyd_sim,
daq_configuration_path=DAQ_CONFIGURATION_PATH,
)

This allows optional serial connection and creation of fake devices instead of real devices via device_instantiation. The ophyd-async mechanism for this is using the connect method:

async def connect(self, sim: bool = False, timeout: float = DEFAULT_TIMEOUT):

(you could argue that fake would be a better name than sim, so that could also be changed)

I would like to be able to:

  • Create all the Devices for a beamline
  • Have blueapi know what those Devices are
  • Connect those Devices in the fastest possible way so we can get to running a plan quickly

I will make some suggestions and I'd like to encourage a vigorous discussion of the alternatives here!

Suggestion 1: Background idempotent connect

Rather than an explicit connect when creating the Device, add it to a background queue so it will be connected eventually. I would also suggest we make connect() idempotent so that it will complete immediately if it has already been run successfully. If we also use Devices as arguments to plans then blueapi could call connect() on all the Devices that are plan ags (or some other decorator solution a-la hyperion composites).

# ophyd-async
class Device:
    # None if connect hasn't started, an Event if it has, a set Event if it's done
    _connect_task: Optional[asyncio.Task] = None
    async def connect(self, sim: bool = False, timeout: float = DEFAULT_TIMEOUT):
        if not self._connect_task or (self._connect_task.done() and self._connect_task.exception()):
            # connect never run, or it errored, so try to connect now
            coros = {
                name: child_device.connect(sim, timeout=timeout)
                for name, child_device in self.children()
            }
            self._connect_task = asyncio.create_task(wait_for_connection(**coros))
        # Wait for it to complete
        await self._connect_task
# blueapi
async def run_plan_sketch(plan, kwargs):
    coros = {name: device.connect() for name, device in get_devices(plan, kwargs)}
    await wait_for_connection(**coros)
    RE(plan(**kwargs))

Suggestion 2: Remove device factories

If connect is no longer happening at Device instantiation then it should be safe to always have them created at startup, then we can just import them with the minimum boilerplate. Connection can be done outside this function.

# dodal/beamlines/i22.py
BL = "BL22I"
panda1 = PandAHDF(f"{BL}-EA-PANDA-01")
panda2 = PandAHDF(f"{BL}-EA-PANDA-02")
i0 = TetrammHDF(f"{BL}-EA-TTRM-01")
# blueapi
async def connect_and_name(bl, sim, timeout):
    coros = {}
    for name, obj in inspect.getmembers(bl):
        if isinstance(obj, Device):
            obj.set_name(name)
            coros[name] = obj.connect(sim, timeout)
    # Also add the device to some registry
    return asyncio.create_task(wait_for_connection(**coros))

Not sure if connect_and_name should be done at the bottom of i22.py or not...

Acceptance Criteria

  • Dodal can do parallel connect of all devices in abeamline
  • Blueapi can fine all those devices
@Tom-Willemsen
Copy link
Contributor

Observation: this connection logic seems like an area with significant hidden complexity/code churn. When we decide on an approach, can we make sure we write an ADR documenting exactly what we considered and why we chose that solution?


I think we need to consider both "optional" and "permanent" devices in this connection logic - where a given device might be optional or not on any given beamline. I previously wrote a similar issue in blueapi, but I think it makes sense to consider it here if we're going to refactor the connection logic.

In fact, I think there might be several cases:

  • Failure to connect to "permanent" devices should cause a startup crash
    • e.g. I understand this is what hyperion wants
  • Failure to connect to "permanent" devices should be "fatal", but not cause a startup crash
    • e.g. so that blueapi can propagate a sensible error back to GDA - if it crashes it can't serve an error back to GDA. But it's fine for it to be unhappy.
  • Failure to connect to an "optional" device causes plans using that one device to be unavailable / to error out when starting the plan.
    • e.g. linkam on i22, it won't usually be present, so that device failing to connect shouldn't stop unrelated plans from running.
    • This probably effectively means that a blueapi calls connect on all relevant devices just before running a plan to ensure they exist at the point the plan is being run.
  • Is there ever a use case where we'd want to defer connection until part way through a plan - i.e. explicitly not connect a device until the last moment when it's needed?
    • I haven't actually come across a use case for this, but could imagine a plan like:
def plan():
    mode = yield from bps.rd("some config pv")
    if mode == "A":
        explicit_connect(device_a)
        the_value = yield from bps.rd(device_a)  # Where device b isn't available in mode "A"
    else:
        explicit_connect(device_b)
        the_value = yield from bps.rd(device_b)  # Where device a isn't available in mode "B"

    # Do something with the_value
    ...

Thinking out loud here, I'm thinking of a syntax along the lines of:

# dodal/i22.py

@connection.EAGER
def eiger():
    # Eager connection when blueapi starts, failure to connect causes blueapi to be "unhappy"
    return Eiger("some_prefix")

@connection.DEFERRED  
def linkam():
    # connection only when a plan actually needs it - blueapi can start without these devices
    return Linkam("some_prefix")

@connection.EXPLICIT
def weird_device():
    # Never automatically connected by blueapi, has to be connected "manually", either in a plan
    # or "manually" before running a specific plan
    # e.g. could cover last use case above
    ...

This would also leave us scope for different connection behaviours later on if we've missed a use-case. Happy to bikeshed over names and/or what the default/unannotated should be.


Some more specific comments:

Connect those Devices in the fastest possible way so we can get to running a plan quickly

I think this is true for hyperion, but less true for i22. For i22, as the plans are in general pretty slow compared to connection times, I'd be willing to trade some performance for better / more ergonomic error reporting. i22 are also significantly more likely to run into these errors due to the nature of all their temporary equipment.

Rather than an explicit connect when creating the Device, add it to a background queue so it will be connected eventually

At what point would connection errors be reported if the connection is happening in a background queue? When you first try to use that device in a plan?

I would also suggest we make connect() idempotent so that it will complete immediately if it has already been run successfully.

Hmmm. Thinking about the i22 case for example, let's say I start blueAPI at the start of run. when the linkam is available, and then later unplug it / switch off it's IOC. I probably don't want subsequent calls to connect to work.

Maybe we need separate concepts for connect versus ensure_connected - where the former would be idempotent, but the latter would really check that the device is still available at the point it's run (or try to reconnect if it wasn't previously)? Name bikeshedding welcome.

Q: If all the PV connections are happening in parallel via asyncio, does connect for a device actually take significant time? (I can see how it would take a long time in the sequential, non-aio case)


Sorry if this is a bit long... but you did

encourage a vigorous discussion

;)

@d-perl
Copy link
Contributor

d-perl commented Apr 5, 2024

Is there ever a use case where we'd want to defer connection until part way through a plan - i.e. explicitly not connect a device until the last moment when it's needed?

Yes, I think your example is somewhat likely to be desired, but I think it's fine for that to be case where it's up to you to do the explicit connection, it could hardly be otherwise

At what point would connection errors be reported if the connection is happening in a background queue? When you first try to use that device in a plan?

In my discussion with @coretl yesterday we imagined something like "at the start of the plan, move required & not-yet-connected devices to the front of the queue and block until they are connected/fail", and having everything be "semi-eager" i.e., BlueApi/Dodal will attempt to connect to everything (eventually), but if it can't, it will only report those errors and not exit. It will then only be an error if you attempt to run a plan which uses a device which failed to connect.

This seemed good from the point of view of having fast application startup with minimal faff, and the required flexibility for optional devices. I guess it could end up with a lot of logs if there are lots of optional devices?

Q: If all the PV connections are happening in parallel via asyncio, does connect for a device actually take significant time?

If there are thousands of them (a whole beamline worth) maybe yeah

@DominicOram
Copy link
Contributor

DominicOram commented Apr 8, 2024

Observation: this connection logic seems like an area with significant hidden complexity/code churn. When we decide on an approach, can we make sure we write an ADR documenting exactly what we considered and why we chose that solution?

Definitely agree with this, this is my 3rd time having this discussion.

Create all the Devices for a beamline

As Tom said, we might not want all the devices for a beamline.

Failure to connect to "permanent" devices should cause a startup crash
e.g. I understand this is what hyperion wants
Failure to connect to "permanent" devices should be "fatal", but not cause a startup crash
e.g. so that blueapi can propagate a sensible error back to GDA - if it crashes it can't serve an error back to GDA. But it's fine for it to be unhappy.

I think the 2nd would be better for Hyperion too. In fact I think the GDA behavior that we have on some beamlines of not being able to start if some devices aren't connected is

Connect those Devices in the fastest possible way so we can get to running a plan quickly

I think this is true for hyperion...

This isn't actually that necessary for Hyperion either. We don't care if connect takes some time, as long as it's only done once at start up. We shouldn't be restarting Hyperion often so the time it takes to start up is irrelevant. The logic in dodal is there because we originally were connecting at the start of every plan and connect was an operation that took time and slowed us down.

Maybe we need separate concepts for connect versus ensure_connected - where the former would be idempotent, but the latter would really check that the device is still available at the point it's run (or try to reconnect if it wasn't previously)? Name bikeshedding welcome.

Our view on this would probably be that if the check to see if we're connected takes a small amount of time (<100ms for all devices) then we can block on it at the start of a plan and happily sit there and wait for it to be connected. Otherwise we would probably rather proceed without the check and take the hit when we crash later.

I guess circling back to Tom C's original suggestions:

  • Suggestion 1: In favor of making connect idempotent (or fast enough in the happy path that it may as well be). I don't think background connection is completely necessary, we would probably always want it to be blocking the first time
  • Suggestion 2: The original idea of the factory functions is that:
    • We're guaranteed not to do work on import (I think there's still a danger here devices are doing work on import, and in fact sometimes that would be preferable)
    • When scientists are writing scripts (outside of Blueapi) they can easily get to a device that is already connected.

I think I stand by both of those statements still and the suggestion breaks them both.

I do think we're going to struggle to get to a one-size-fits all solution here. I do like Tom's decorators on the devices but we might need more customisability as well

I think this needs broad discussion, probably in person, and a comprehensive final document on it.

@coretl
Copy link
Collaborator Author

coretl commented Apr 8, 2024

I think this needs broad discussion, probably in person, and a comprehensive final document on it.

Agreed, I will set this up, during the hackathon or after?

Create all the Devices for a beamline

As Tom said, we might not want all the devices for a beamline.

But I'd like to explore this question first, if Device.__init__ cannot do any I/O, so therefore is always fast and safe to do even if the hardware is not present, why would we not create the Device? I understand why we wouldn't connect() it, but is there a reason not to create it in the first place? If so, then what should control whether it is created or not?

@DominicOram
Copy link
Contributor

DominicOram commented Apr 8, 2024

Agreed, I will set this up, during the hackathon or after?

I vote after, there's already enough going on.

But I'd like to explore this question first, if Device.init cannot do any I/O

Sorry @coretl, I think I updated my comment above answering some of these question more.

@DominicOram
Copy link
Contributor

We're guaranteed not to do work on import (I think there's still a danger here devices are doing work on import, and in fact sometimes that would be preferable)

An example of this currently is the OAV. The OAV function takes a config object (based on a file) on __init__, this file is needed for the OAV to operate correctly so from that Device's perspective it makes sense for it to be a required parameter in the init, it means that we can be sure for the lifetime of the Device we have some valid config. So to fix this under the suggested change the options are:

  • Do I/O on import :(
  • Create some new configure function on OAV (which is what we have in GDA), consequences of this are:
    • All plans must always call it, if they forget the device is invalid and you may not know until you use it
    • It's now optional, every time I use it I have to check it to appease the type checker

@coretl
Copy link
Collaborator Author

coretl commented Apr 8, 2024

We're guaranteed not to do work on import (I think there's still a danger here devices are doing work on import, and in fact sometimes that would be preferable)

An example of this currently is the OAV. The OAV function takes a config object (based on a file) on __init__, this file is needed for the OAV to operate correctly so from that Device's perspective it makes sense for it to be a required parameter in the init, it means that we can be sure for the lifetime of the Device we have some valid config. So to fix this under the suggested change the options are:

* Do I/O on import :(

* Create some new `configure` function on `OAV` (which is what we have in GDA), consequences of this are:
  
  * All plans must always call it, if they forget the device is invalid and you may not know until you use it
  * It's now optional, every time I use it I have to check it to appease the type checker

Ok, that is a good example, I think I'm sold on the need for device factories and some variant of @Tom-Willemsen's connection decorators. Will try and propose something in advance of the meeting and we can argue about it then

@coretl
Copy link
Collaborator Author

coretl commented Apr 8, 2024

When scientists are writing scripts (outside of Blueapi) they can easily get to a device that is already connected.

I would like to explore this requirement, as there are tricky interactions with the RunEngine event loop that we need to consider, are these Devices ever used without the RE being present, or can we say that this is a requirement?

@DominicOram
Copy link
Contributor

Ok, that is a good example, I think I'm sold on the need for device factories

To play devil's advocate. There is a 3rd solution to the OAV issue where it takes a configuration object but this doesn't actually do the file I/O until it is triggered in the connect of OAV. This may solve the issue but it's very much not immediately obvious to a developer why you would do it.

I think I basically worry that without the factories future developers will quickly fall into the patterns of work on import or configure functions, especially given that these both exist in GDA already. I think both of those solutions are worse than the factory function (which at the moment my only dislike of is that it has too much boilerplate).

I would like to explore this requirement, as there are tricky interactions with the RunEngine event loop that we need to consider, are these Devices ever used without the RE being present, or can we say that this is a requirement?

I think we can say that the RE is a requirement, as you'll probably be using one anyway. I would basically like to keep the barrier to entry of writing some basic plans as low as possible, and at the moment it is:

from dodal.my_beamline import device
from bluesky import RunEngine

RE = RunEngine()
my_device = device()

def my_plan():
    yield from ...

RE(my_plan())

Introducing BlueAPI as a dependency here gives nothing to the scientist that wants to try something out quickly

@Tom-Willemsen
Copy link
Contributor

I agree with Dom's comment about minimising the barrier of entry to basic plans.

I would like to explore this requirement, as there are tricky interactions with the RunEngine event loop that we need to consider, are these Devices ever used without the RE being present, or can we say that this is a requirement?

From a support perspective, I'd really like to try and minimise "tricky interactions" - whether that's for devs or users. So if making the RE a hard requirement eliminates those complexities, then I think I'm probably in favour. To be honest I'm struggling to think of a user-facing use case for not having the runengine present, am I missing something?

@coretl
Copy link
Collaborator Author

coretl commented Apr 8, 2024

from dodal.my_beamline import device
from bluesky import RunEngine

RE = RunEngine()
my_device = device()

def my_plan():
    yield from ...

RE(my_plan())

This makes sense, and was the only non-blueapi solution I was planning to explicitly support. If we do something like:

def device_factory(connect=DeviceConnection.EAGER, set_name=True):
    def wrapper(f):
        def factory(sim=False, timeout=DEFAULT_TIMEOUT):
            if f in global_devices:
                return global_devices[f]
            device = f()
            global_devices[f] = device
            if set_name:
                device.set_name(f.__name__)
            call_in_bluesky_event_loop(device.connect(sim, timeout))
            return device
        global_factories[factory] = connect
        return factory
    return wrapper

@device_factory(connect=DeviceConnection.EAGER)
def device():
    return MyDevice("MY-EPICS-PREFIX")

Then in the absence of blueapi it will serially connect the device on first call of the factory. We can also provide functions like connect_all_eager_devices() that blueapi can call to do a parallel connect upfront. The only limitation with this approach is that factory() can only do device connection in the main thread. We would have to provide some explicit_connect() plan or async function if we need to do that in the run engine event loop (like in a plan).

@coretl
Copy link
Collaborator Author

coretl commented Apr 23, 2024

We had a meeting on this, and based on this I now propose:

  • Idempotent Device.connect seems like a good idea as long as we can optionally pass force=True to it
  • There will be 2 types of Device
    • EAGER: Blueapi will connect at startup
    • LAZY: Blueapi will connect when used as plan args
  • Blueapi will ensure that all Devices passed as plan args (and default args) are connected
  • There will be an ensure_connected(device, force=False) plan stub that can be used to explicitly make and connect Devices in plans
  • There is an opt-in connect=True argument to the device factory that can be passed in for basic plan development usage
  • DeviceConnector goes away in favour of the connect=True argument

This means we have:

# i22.py
@device_factory(connect=DeviceConnection.EAGER):
def saxs() -> PilatusDetector:
    return PilatusDetector("BL22I-EA-PILAT-01")

# my_plans.py
from dodal.beamlines import i22

def no_arg_plan():
    # plan does connect
    saxs = yield from ensure_connected(i22.saxs())
    yield from bps.count([saxs])

def default_arg_plan(det = i22.saxs()):
    # blueapi does connect
    yield from bps.count([det])

def require_arg_plan(det: Readable):
    # blueapi does connect
    yield from bps.count([det])

# my_testing_script.py
from dodal.beamlines import i22
from bluesky import RunEngine

RE = RunEngine()
saxs = i22.saxs(connect=True)

def my_plan(det: Readable):
    yield from bps.count([det])

RE(my_plan(saxs))

Is everyone happy with this?

@coretl
Copy link
Collaborator Author

coretl commented Apr 23, 2024

I guess you could do composites like this:

def get_connected_devices() -> Tuple[Eiger, Panda, Goniometer]:
    devices = yield from ensure_connected(ixx.eiger(), ixx.panda(), ixx.goniometer)
    return devices

@d-perl
Copy link
Contributor

d-perl commented Apr 23, 2024

Re: getting rid of composites, I would like it if it would support Unpack (https://peps.python.org/pep-0692/#specification) so that we could at least still do something like:

class PlanDevices(TypedDict):
    backlight: Backlight
    gonio: Gonio
    detector: Detector
    zebra: Zebra

def inner_plan(parameters, **kwargs: Unpack[PlanDevices]):
    yield from do_some_stuff(kwargs["zebra"])
    yield from do_some_other_stuff(kwargs["gonio"])

def outer_plan(parameters, **kwargs: Unpack[PlanDevices]):
    yield from move_backlight(kwargs["backlight"])
    yield from inner_plan(parameters, **kwargs)
    yield from do_some_detector(kwargs["detector"])

to stop having enormous parameter lists when there are a lot of devices and nested plan sections using them

@coretl
Copy link
Collaborator Author

coretl commented Apr 23, 2024

Would you actually use outer_plan as an entrypoint though? If would mean calling outer_plan(parameters, backlight=backlight, gonio=gonio, ...) rather than just outer_plan(parameters). Would this work instead?

class PlanDevices(TypedDict):
    backlight: Backlight
    gonio: Gonio
    detector: Detector
    zebra: Zebra

def inner_plan(parameters, **kwargs: Unpack[PlanDevices]):
    yield from do_some_stuff(kwargs["zebra"])
    yield from do_some_other_stuff(kwargs["gonio"])

def outer_plan(parameters):
    kwargs = ensure_connected(
        backlight=ixx.backlight(),
        gonio=ixx.gonio(),
        detector=ixx.eiger(),
        zebra=ixx.zebra(),
    )
    yield from move_backlight(kwargs["backlight"])
    yield from inner_plan(parameters, **kwargs)
    yield from do_some_detector(kwargs["detector"]) 

For this to work with typing we'd have to make sure type checkers are happy with this:

def ensure_connected(**kwargs: Unpack[T]) -> T: ...

@d-perl
Copy link
Contributor

d-perl commented Apr 23, 2024

the idea is that it is nice shorthand for

def outer_plan(parameters, *, backlight: Backlight, zebra: Zebra, gonio: Gonio, detector: Detector):
    ...

so that blueAPI can give them to us - in the case where we explicitly call ensure_connected I would rather use a dataclass (for dot access) like we already do - but yes, we can definitely work with that suggestion.

@coretl
Copy link
Collaborator Author

coretl commented Apr 23, 2024

Or we could keep the composite:

@dataclass
class PlanDevices:
    backlight: Backlight = ixx.backlight()
    gonio: Gonio = ixx.gonio()
    detector: Detector = ixx.eiger()
    zebra: Zebra = ixx.zebra()

def inner_plan(parameters, devices: PlanDevices):
    yield from do_some_stuff(devices.zebra)
    yield from do_some_other_stuff(devices.gonio)

def outer_plan(parameters):
    devices = PlanDevices()
    yield from ensure_connected(*asdict(devices).values())
    yield from move_backlight(devices.backlight)
    yield from inner_plan(parameters, devices)
    yield from do_some_detector(devices.detector) 

@d-perl
Copy link
Contributor

d-perl commented Apr 23, 2024

yeah, that's very close to what we do now, seems fine

@coretl
Copy link
Collaborator Author

coretl commented Apr 26, 2024

More conversations with @callumforrester and other DAQ members led us to this proposal:

  • ophyd-async: make Device.connect(mock, timeout, force=False) be idempotent
  • ophyd-async: make ensure_connected(*devices) plan stub
  • dodal: make device_factory(startup_connect=False) decorator that makes, names, caches and optionally connects a device
  • dodal: make get_device_factories() that returns all device factories and whether they should be connected at startup
  • blueapi: call get_device_factories(), make all the Devices, connect the ones that should be connected at startup in parallel and log those that fail
  • blueapi: when plan is called, run ensure_connected on all plan args and defaults that are Devices

We can then iterate on this if the parallel connect causes a broadcast storm. We could also in future add a monitor to a heartbeat PV per device in Device.connect so that it would reconnect next time it was called.

@DominicOram are you happy with this approach? If so I'll go ahead and make the tickets in the 3 repos for these items and write and ADR in dodal.

@coretl
Copy link
Collaborator Author

coretl commented Apr 29, 2024

I've made 3 more targeted issues so closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants