Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalar coords vs. concat #1151

Open
crusaderky opened this issue Dec 3, 2016 · 11 comments
Open

Scalar coords vs. concat #1151

crusaderky opened this issue Dec 3, 2016 · 11 comments
Labels
topic-combine combine/concat/merge

Comments

@crusaderky
Copy link
Contributor

Why does this work:

>> import xarray
>> a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'y': 10})
>> b = xarray.DataArray([4, 5, 6], dims=['x'])
>> a + b
<xarray.DataArray (x: 3)>
array([5, 7, 9])
Coordinates:
    y        int64 10

But this doesn't?

>> xarray.concat([a, b], dim='x')
KeyError: 'y'

It doesn't seem coherent to me...

@shoyer
Copy link
Member

shoyer commented Dec 5, 2016

I agree -- we really should do an outer join of variables/coordinates in concat. I think here is another GitHub issue tracking this (this has certainly been raised before).

@crusaderky
Copy link
Contributor Author

crusaderky commented Dec 5, 2016

I wrote down this.

  1. should I simply add it to xarray.broadcast()?
  2. can you spot any logical flaws?
  3. suggestions for cleaner code?
    [edit] removed obsolete code

@crusaderky
Copy link
Contributor Author

I updated the code and tested it in my project - it works fine.
Any suggestions before I start integrating it inside xarray.broadcast()?

def broadcast_scalar_coords(*args, exclude=()):
    """Broadcast scalar coords whenever they're unambiguous,
    and remove them when they are ambiguous.
    This allows replicating the behaviour of addition
    or moltiplication of xarrays in concatenation.
    """
    unambiguous = {}
    ambiguous = set()
    for arg in args:
        for k, v in arg.coords.items():
            if k in exclude:
                continue
            # non-scalar coords are automatically ambiguous
            if v.shape == ():
                if k not in ambiguous:
                    v = v.values.tolist()
                    if unambiguous.get(k, v) != v:
                        del unambiguous[k]
                        ambiguous.add(k)
                    else:
                        unambiguous[k] = v
            else:
                ambiguous.add(k)
                unambiguous.pop(k, None)

    args = tuple(arg.copy(deep=False) for arg in args)
    for arg in args:
        for k, v in unambiguous.items():
            arg.coords[k] = v
        for k in ambiguous:
            if k in arg.coords and arg.coords[k].shape == ():
                del arg.coords[k]
    return args

@shoyer
Copy link
Member

shoyer commented Dec 27, 2016

@crusaderky I don't understand the distinction between ambiguous/unambiguous coords -- can you clarify? Also, why treat scalar coordinates differently? Most xarray machinery doesn't special case arrays of a certain dimensionality.

@crusaderky
Copy link
Contributor Author

@shoyer the end goal is to replicate in concat() the same behaviour you already hav in __add__ and __mul__.
So take for example

a = xarray.DataArray([1, 2, 3], dims=['x'], coords={'y': 10})
b = xarray.DataArray([4, 5, 6], dims=['x'])
c = xarray.DataArray([7, 8, 9], dims=['x'], coords={'y': 20})

a+b -> y is propagated to the result (unambiguous)
a+c -> y is discarded (ambiguous)

I suppose I could look at how this happens in __add__ and move the logic up into broadcast() (which __add__ should already invoke)?

@shoyer
Copy link
Member

shoyer commented Dec 28, 2016

@crusaderky Ah, I understand now. I agree that this makes sense for concatenating along an existing dimension (e.g., xarray.concat([a, b], dim='x') or xarray.concat([a, c], dim='x')) if the variables do not have the dimension to be concatenated. The existing logic to handle merging coordinates with possible dropping is the merge_variables function in merge.py (if compat='minimal') -- note that none of the logic is specific to scalar coordinates.

For concatenating along a new dimension (e.g., xarray.concat([a, b], dim='z') or xarray.concat([a, c], dim='z')), I think we would want to default to a scalar coordinate of the appropriate missing value (e.g., coords={'y': np.nan}).

@crusaderky
Copy link
Contributor Author

crusaderky commented Dec 29, 2016

I just realised that xarray today already implements meaningful logic when concatenating between a and c in my example above - both on an existing dimension as well as on a new one. It's just a and b that don't work. I agree that defaulting to NaN would be a good option (regardless if the dimension is new or existing).

So to recap, when concatenating [a, b, c], where b does not have the y coord:

  • if the dtype of the y coord in both a and c is float, numpy.float32, or numpy.float64, fill in with NaN
  • if it's any datetime format, fill in with NaT
  • if it's strings, fill in with empty string??
  • in any other case, convert everything to object and fill in with None

Correct?
Is there any helper function to facilitate this?

@shoyer
Copy link
Member

shoyer commented Dec 29, 2016

Is there any helper function to facilitate this?

Yes, indeed. _maybe_promote returns dtype and fill_value for inserting missing values, based on an original dtype. I guess it makes sense to start with calling one of the numpy utilities such as numpy.result_type for finding the common dtype to insert into _maybe_promote.

@stale
Copy link

stale bot commented Jan 24, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 24, 2019
@gimperiale
Copy link
Contributor

This issue is still valid as of xarray 0.11.0

@stale stale bot closed this as completed Mar 3, 2019
@shoyer shoyer reopened this Mar 3, 2019
@stale stale bot removed the stale label Mar 3, 2019
@dcherian
Copy link
Contributor

xr.concat([a, b], dim="x") was fixed by #3769

xr.concat([a, b, c], dim="x" still gives an error

ValueError: 'y' not present in all datasets and coords='different'. Either add 'y' to datasets where it is missing or specify coords='minimal'.

@dcherian dcherian added the topic-combine combine/concat/merge label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
None yet
Development

No branches or pull requests

4 participants