Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement groupby for continuous dimensions #1018

Open
jbednar opened this issue Dec 13, 2016 · 13 comments
Open

Implement groupby for continuous dimensions #1018

jbednar opened this issue Dec 13, 2016 · 13 comments
Labels
type: feature A major new feature
Milestone

Comments

@jbednar
Copy link
Member

jbednar commented Dec 13, 2016

HoloViews supports a groupby operation for discrete/categorical dimensions, but as far as I can see there is no support for grouping over a continuous dimension, which requires a specified bin width. The xarray interface might provide this already (#804), but for pandas the separate cut method would seem to be needed. Philipp suggests adding a method:

def groupby_bin(dataset, dimension, bins=10):
    dimension = dataset.get_dimension(dimension)
    values = dataset.dimension_values(dimension)
    other_dims = [d for d in dataset.kdims if d is not dimension]
    cats, bins = pd.cut(values, bins, retbins=True)
    hmap = hv.HoloMap(kdims=[dimension])
    for i in range(1, len(bins)):
        start, end = bins[i-1], bins[i]
        mid = np.mean([start, end])
        hmap[mid] = dataset.select(**{dimension.name: (start, end)}).reindex(other_dims)
    return hmap

but I have not tested this.

@philippjfr
Copy link
Member

philippjfr commented Jan 28, 2017

So I've been rethinking this I think rather than implementing another groupby_bin method or folding this into the existing groupby method it would be useful to make the binning a separate operation. I'd suggest calling it resample allowing you to resample the data along any key dimensions by supplying the number of bins or an explicit list of bins. The method would then replace the coordinates that fall into a particular bin with the supplied label or bin coordinate. Note that in order to resample a gridded dataset you will also have to supply an aggregation function. I believe this would be useful independently of groupby, and the groupby_bins example above could be achieved by running something like dataset.resample(x=10).groupby('x').

@philippjfr
Copy link
Member

Prototyped a rebinning function, which becomes a poor man's datashader when combined with a HeatMap:

https://anaconda.org/philippjfr/rebinning/notebook

@jbednar
Copy link
Member Author

jbednar commented Jan 28, 2017

Sounds like a good idea, but "resample" doesn't convey what it does to me; isn't it rasterize?

@philippjfr
Copy link
Member

No, the aggregation steps and the HeatMap constructor are doing the rasterization in those examples. The rebinning step just replaces existing samples with the center co-ordinate (or label) of each bin. To really support this properly we'll have to add support for bin edges to grid based interfaces, which will also unify Histogram and QuadMesh with the other types (as outlined in #547).

@philippjfr
Copy link
Member

That said I'm also not sure about resample, pandas uses it for largely the same purpose but there it only works for datetime indexes. It really just bins the values along one or more dimensions into the supplied bins along with optional labels for each bin. What it doesn't do is actually aggregate on each n-dimensional bin, i.e. it will retain multiple samples per bin. Rasterization is the combined operation of binning and aggregating 2D coordinates with associated value dimensions.

@jbednar
Copy link
Member Author

jbednar commented Jan 29, 2017

Ah, true; I didn't look closely enough; "rebin" here is definitely not rasterizing. But it's not "rebinning", either, because there aren't any "bins" (in the histogram sense) at all, whether before or after that particular operation. All this actually does is to snap continuous-valued samples onto a grid, right?

I don't remember why I posted this originally, so I can't remember what is useful about snapping things onto a grid if not also aggregating them to get a single value per grid cell. E.g. I don't see how that makes constructing a heatmap any easier (since one still needs to select values in the bin's range before aggregating them). I'm obviously missing/misremembering something.

@jbednar
Copy link
Member Author

jbednar commented Jan 29, 2017

I also don't think "resample" is a valid description. For something of that name, I'd imagine reconstructing some underlying continuous function (e.g. a probability density) by interpolation, and then resampling it. E.g. one could resample it on a grid, leaving one sample per grid point, which would amount to rasterization, or one could resample it at arbitrary points, which would justify the more general "resample" description. But this operation requires a grid (unlike resampling) and leaves multiple samples per grid point (which doesn't fit the idea of a continous underlying function), and does no interpolation (which seems unlike resampling). So, unless I'm confused again, this doesn't sound like resampling to me.

@philippjfr
Copy link
Member

philippjfr commented Jan 29, 2017

I don't remember why I posted this originally, so I can't remember what is useful about snapping things onto a grid if not also aggregating them to get a single value per grid cell.

You might have a bunch of flight coordinates along with altitude, but want to split them into a number of categories, so you bin the altitudes into three groups 'low', 'medium', 'high', and then groupby altitude, this would be expressed as something like:

dataset.bin(altitude=dict(bins=3, label=['low', 'medium', 'high'])).groupby('altitude')

I don't see how that makes constructing a heatmap any easier (since one still needs to select values in the bin's range before aggregating them). I'm obviously missing/misremembering something.

Don't quite know what you mean.

So, unless I'm confused again, this doesn't sound like resampling to me.

True, it's really just binning the coordinates along one or more columns/dimensions. The main issue here is that our datasets do not yet support storing any actual bins, instead we have to assign a label or simply taking the center of the bin, thereby "snapping" to a grid.

@philippjfr
Copy link
Member

Maybe it's best to put this on hold until we've made more progress on #547, which will let us represent actual bins.

@jbednar
Copy link
Member Author

jbednar commented Jan 29, 2017

Sounds like a good idea.

Don't quite know what you mean.

I haven't been able to work out what the actual code you wrote does, so I'm trying to approach this at a fundamental mathematical level, and probably failing, perhaps because all the examples I saw already have aggregation, so I can't see what the actual "rebin" operation does before aggregation.

I'm imagining that what it does is to rewrite the coordinates to snap them to a grid, leaving a flat data structure with points that happen to line up but are otherwise in the same type of data structure that they started in. But maybe that's not true, and it's now hierarchical, with actual container objects for each bin, containing a bunch of points per bin. If there are such containers, then this operation would make complete sense to me as a groupby, i.e. the first part of making a histogram. If there are not, i.e. the resulting data structure is flat but snapped, then I don't see what the utility is. Probably best explained verbally. :-)

@philippjfr
Copy link
Member

#547 is now merged so we can store bins properly, we could now revisit this.

@philippjfr philippjfr modified the milestones: v1.10, v2.0 Feb 10, 2018
@TylerTCF
Copy link

TylerTCF commented May 1, 2018

Would this resample allow users to take a datetime kdim and aggregate to a different sampling frequency like hourly measurements being aggregated into daily, monthly, yearly intervals?

@philippjfr
Copy link
Member

This can now technically be done with the new transform method:

ds = hv.Dataset(np.random.randn(1000, 3), ['x', 'y'], 'z')
ds.transform(x=hv.dim('x').bin(np.linspace(-1, 1, 11))).groupby('x').apply(hv.Scatter)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A major new feature
Projects
None yet
Development

No branches or pull requests

3 participants