Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_id in romancal #1259

Closed
braingram opened this issue May 31, 2024 · 16 comments · Fixed by #1448
Closed

group_id in romancal #1259

braingram opened this issue May 31, 2024 · 16 comments · Fixed by #1448

Comments

@braingram
Copy link
Collaborator

For some operations (tweakreg, skymatch, resample) models are grouped based on a group_id.

This is currently defined in ModelContainer.models_grouped:

try:
group_id = "roman" + "_".join(
["".join(params[:3]), "".join(params[3:6]), params[6]]
)
model.meta["group_id"] = group_id
except TypeError:
model.meta["group_id"] = f"exposure{i + 1:04d}"

which assigns a computed value to model.meta.group_id

models_grouped is used in:

Currently group id is always defined in models_grouped ignoring any group_id in the model metadata (and ignoring any group_id in the association, which is used for jwst).

Most of the above steps are part of the mosaic pipeline with the only exception being tweakreg which is part of the exposure level pipeline.

There is no test for the exposure level pipeline that doesn't skip tweakreg so it is difficult to guess what should be done here.

For the mosaic pipeline it is possible the mosaic will contain models from different groups which will result in:

  • outlier detection using the median across groups to detect outlier (however in practice this isn't occurring as outlier detection it taking a median with a N=1)
  • skymatch can group models with the same group id into SkyGroup instances (which have their sky adjusted together). However this appears to be partially supported where skygroups are never created and instead the sky for each model is adjusted separately
  • resample will call many_to_one (it appears the single argument is not passed to ResampleData so this is likely always the case). During many_to_one each model is drizzled separately however the grouping is used for getting the number of pointings and for updating the exposure times.

Some questions I have:

  • Should outlier detection be using many to many and using the groups?
  • Should skymatch be adjusting the sky of models from the same group together (like is done for jwst)?
  • Should resample be using the groupings to determine the exposure time and number of pointings?
@nden
Copy link
Collaborator

nden commented May 31, 2024

group_id is cheaper to compute from the pool file and add to the association than to compute it when it's needed in the L3 pipeline. All the information needed to compute it should be in the pool file. Adding it will remove a lot of opening and closing of files in the Level 3 pipeline.
The fact that its use hasn't been formalized is a problem in both pipelines. JWST is moving towards formalizing its use and an example jwst association is here

https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/container.py#L102-L129

Is there a reason not to take advantage of this?

@nden
Copy link
Collaborator

nden commented May 31, 2024

@stscijgbot-rstdms

@stscijgbot-rstdms
Copy link
Collaborator

This issue is tracked on JIRA as RCAL-852.

@schlafly
Copy link
Collaborator

schlafly commented Jun 1, 2024

Thanks Brett. Broadly, I suspect that we should use this, with exposures defining the group; it would sound like that would mean adding group_id = L2 file name minus the suffix and the SCA name.

Some more specific comments:

  • in outlier detection, I suspect that in terms of the final output this doesn't matter, but practically using the groups will result in denser resampled images contributing to the median filtering (with the caveat, as you say, that the current median filtering isn't right)
  • Our goal for the sky matching was to match SCAs and not to match the whole array, so that "sounds right" to me for the moment. i.e., we wouldn't want to make SkyGroups that merge the whole array together. But I think I understand you to be saying that if we set these group ids that presently undesirable behavior wouldn't happen automatically.
  • For resample, it sounds like the exposure times will be fine either way---since the exposure times for the different SCAs of the same exposure are the same, we don't really care whether those max's and min's are grouped or ungrouped. The number of pointings would be more natural if we set the groups.

@braingram
Copy link
Collaborator Author

Thanks! To check that I'm following the discussion so far:

  • outlier detection could likely use grouping based on a pattern similar to the L2 file name
  • grouping is not needed in skymatch as the skys for SCAs will be matched separately. As it appears the group ids are already being ignored here I think this matches what is currently being done.
  • resample might be the other reason (aside from outlier detection) to keep the grouping API in the container (as it's currently used in this step)

Since there are 2 uses for grouping (outlier detection and resample) and the code currently expects the grouping to be done in the container it sound like it makes sense to keep the grouping API on the container. This also has the benefit that if another step wants to used the groupings it's already part of the common code (as it is now with ModelContainer).

One additional question is do we want to support having the group id in the association? As @nden mentioned this is much more efficient (currently models_grouped will open every model as that's the only way to determine the grouping). Allowing this would also make it easier to make the container part of the common jwst/romancal code (perhaps in stpipe or stcal) as jwst allows setting the group id in the association. Allowing group id to be defined in the association would mean that a user could modify the group id by changing the association (although the user could currently modify the group id by changing the files). I don't see this last point as a negative and given the other benefits do you think it's worth allowing the group id to be defined in the association?

@braingram
Copy link
Collaborator Author

Looking at the group id calculation the current ModelContainer generates a group id that doesn't quite match the l2 filenames (ignoring the SCA name and suffix).

in l2 filename in group id
program id ✔️
execution plan
pass
segment
visit number ✔️
visit group ✔️
sequence id ✔️
activity ✔️
exposure ✔️

Looking at the RAD schemas, would using meta.observation.obs_id make sense:
https://github.com/spacetelescope/rad/blob/5ccb720777ed90faf13897ef92adf738693d8024/src/rad/resources/schemas/observation-1.0.0.yaml#L9
Is this used and kept up-to-date? If so it looks to be the same as the l2 filename minus the SCA and suffix.

@schlafly
Copy link
Collaborator

schlafly commented Jun 3, 2024

  • outlier detection could likely use grouping based on a pattern similar to the L2 file name
  • grouping is not needed in skymatch as the skys for SCAs will be matched separately. As it appears the group ids are already being ignored here I think this matches what is currently being done.
  • resample might be the other reason (aside from outlier detection) to keep the grouping API in the container (as it's currently used in this step)

Yes. Re sky-matching, we're not intending to fit a single offset per exposure at present, but there is a lot of uncertainty about what level of flexibility the sky matching will end up needing to have. So if it's very easy to have an option to either group by exposure or do individual SCAs, that would be nice. But it sounds like we're following our intended baseline.

One additional question is do we want to support having the group id in the association? As @nden mentioned this is much more efficient (currently models_grouped will open every model as that's the only way to determine the grouping). Allowing this would also make it easier to make the container part of the common jwst/romancal code (perhaps in stpipe or stcal) as jwst allows setting the group id in the association. Allowing group id to be defined in the association would mean that a user could modify the group id by changing the association (although the user could currently modify the group id by changing the files). I don't see this last point as a negative and given the other benefits do you think it's worth allowing the group id to be defined in the association?

That's fine; I don't feel strongly here. The only vague awkwardness about defining it in the file is that I really don't see many cases where we want the group to be something other than an exposure. So there's a little redundancy, but if that redundancy leads us to match Webb it's probably worth it.

I can confirm that recently all L1 files produced by romanisim started including obs_id, and that it looks to me to include all we need to identify an exposure. A potential implementation concern is that it would be easy to try to make some fake exposures by copying real exposures and changing the file names, believing that one was generating exposures that would be treated as separate exposures. But if we fail to update obs_id you could be surprised when they're all part of the same exposure. Or, relatedly, we keep the same information in the actual file name, in meta.filename, and in obsid, and it can be hard to keep these in sync. So I might prefer a default group based on the first N characters of the file name. But I don't feel strongly here.

@braingram
Copy link
Collaborator Author

Keeping things in sync is definitely a concern. Perhaps this is off topic but it looks like obs_id is not used in romancal:
https://github.com/search?q=repo%3Aspacetelescope%2Fromancal+obs_id&type=code
Since this is a combination of other metadata is there a reason it's a separate metadata entry (and not a dynamic property of the datamodel)? The same is true for visit_id and in some sense meta.filename could also be dynamic (perhaps a filename property of the datamodel). This might help to avoid issues like #818

@schlafly
Copy link
Collaborator

schlafly commented Jun 3, 2024

Right, obs_id is set in romanisim during the creation of the L1 files and then not changed by romancal. Yes, I would be sympathetic to removing meta.filename, but I'm told stpipe cares about it and so that would be a bigger lift. I do think there's a use to having an overall exposure id (obs_id presently?) in the downstream archive, etc., so I'm not really militating for getting rid of obs_id.

@braingram
Copy link
Collaborator Author

Thanks! Good point about stpipe and meta.filename. Let's leave any changes to that to a separate issue. It's certainly confusing to me.

@mairanteodoro
Copy link
Collaborator

@braingram I supposed this issue has been resolved by ModelLibrary, correct?

@braingram
Copy link
Collaborator Author

Thanks! Yeah. I'll close the issue.

@schlafly
Copy link
Collaborator

schlafly commented Oct 9, 2024

Part of this closed issue was what Roman wants to do for grouping. I think conceptually the answer is:

  • group by exposure / obs_id for outlier detection, tweakreg, and resample
  • group by SCA for sky matching
    Did ModelLibrary solve that for us naturally? Sorry, I forgot about this detail.

@braingram braingram reopened this Oct 10, 2024
@braingram
Copy link
Collaborator Author

I reopened this as I think what you described doesn't match what's currently used in romancal.

ModelLibrary matched the previous grouping by ModelContainer and now generates a group_id using:

"roman{program}{observation}{visit}"
"_{visit_file_group}{visit_file_sequence}{visit_file_activity}"
"_{exposure}"

which is used for all steps that use grouping. The steps that use grouping are:

  • tweakreg
  • outlier detection
  • resample

This doesn't quite match obs_id (see #1259 (comment)).

There is also no grouping in sky match. Is that the same as "group by SCA"?

@schlafly
Copy link
Collaborator

Thank you! It looks to me like obs_id is more correct; i.e., the current approach will group exposures that have different execution plans, passes, and segments, but the same other keywords.

The way that I think about this, Roman data is a stream of exposures, each of which has 18 SCAs. If I controlled the Roman file names, they would just be r{EXPID}_wfi{SCA}.asdf, where EXPID just incremented exposure by exposure. We've chosen a more complicated structure for EXPID that has a lot of different pieces that we have endowed with some meaning, but it's rather artificial and we don't want the grouping to know about it. I expect library.py should just use obsid.

No grouping in sky match is the same as grouping by SCA, since each WfiImage is one SCA, so no changes there, thanks.

@braingram
Copy link
Collaborator Author

Switching the grouping to obs_id should be a straightforward edit of the above linked group id calculation. There are issues that will pop up with fake data in unit tests since obs_id might be a dummy value etc. It's also possible that obs_id will be equal to some value that isn't the same as the combined metadata that makes up the string. I'm not very worried about that for real files and if romanisim fills in the value for simulated data perhaps there's no benefit to trying to make roman_datamodels compute obs_id by combining the values.

I started a draft PR here:
#1448
and a regression tests run:
https://github.com/spacetelescope/RegressionTests/actions/runs/11281421723

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants