Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling of invalid files in open_mfdataset #6736

Open
vnoel opened this issue Jun 29, 2022 · 4 comments · May be fixed by #9955
Open

better handling of invalid files in open_mfdataset #6736

vnoel opened this issue Jun 29, 2022 · 4 comments · May be fixed by #9955

Comments

@vnoel
Copy link
Contributor

vnoel commented Jun 29, 2022

Is your feature request related to a problem?

Suppose I'm trying to read a large number of netCDF files with open_mfdataset.

Now suppose that one of those files is for some reason incorrect -- for instance there was a problem during the creation of that particular file, and its file size is zero, or it is not valid netCDF. The file exists, but it is invalid.

Currently open_mfdataset will raise an exception with the message
ValueError: did not find a match in any of xarray's currently installed IO backends

As far as I can tell, there is currently no way to identify which one(s) of the files being read is the source of the problem. If there are several hundreds of those, finding the problematic files is a task by itself, even though xarray probably knows them.

Describe the solution you'd like

It would be most useful to this particular user if the error message could somehow identify the file(s) responsible for the exception.

Apart from better reporting, I would find it very useful if I could pass to open_mfdataset some kind of argument that would make it ignore invalid files altogether (ignore_invalid=False comes to mind).

Describe alternatives you've considered

No response

Additional context

No response

@dcherian
Copy link
Contributor

t would be most useful to this particular user if the error message could somehow identify the file(s) responsible for the exception.

+1. You could make this change to open_dataset and it will be raised with open_mfdataset too. Attempting to read a bad netCDF file is a common source of trouble. So an error saying something like

Reading file XXX failed. The file is possibly corrupted, or the file path is wrong.

would be quite helpful!

I would find it very useful if I could pass to open_mfdataset some kind of argument that would make it ignore invalid files altogether (ignore_invalid=False comes to mind).

This I'm not sure about because a user wouldn't know if they were missing some data in the middle...

@yt87
Copy link

yt87 commented Jul 9, 2023

My vote is to have both, a warning, and an option to fill missing data with NaNs. My use case:

I have an archive of 15 years of monthly forecasts. For one month one of the ensemble members is missing. I am converting binary format to zarr. The code is:

ds = xr.open_mfdataset(
    paths,
    engine=BinaryBackend,
    dtype=np.float32,
    combine="nested",
    concat_dim=(ensmem_ix, fcsttime_ix, reftime_ix),
    parallel=False,
).rename_vars(foo="sic")

Currently, my only option is to remove the remaining ensemble member data files before processing. Since I have to use a custom backend (based on https://github.com/aurghs/xarray-backend-tutorial/tree/main), I tried to add code to return array filled with nans when np.fromfile() fails. That, however, is not enough, the missing file is also accessed in _chunk_ds() in xarray/backends/api.py, to create a token for dask. That could be easily handled by adding a try ... except block.

@pratiman-91

This comment was marked as duplicate.

@max-sixty
Copy link
Collaborator

Contributions welcome!

@pratiman-91 pratiman-91 linked a pull request Jan 16, 2025 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants