Implement groupby for continuous dimensions #1018

jbednar · 2016-12-13T18:51:17Z

HoloViews supports a groupby operation for discrete/categorical dimensions, but as far as I can see there is no support for grouping over a continuous dimension, which requires a specified bin width. The xarray interface might provide this already (#804), but for pandas the separate cut method would seem to be needed. Philipp suggests adding a method:

def groupby_bin(dataset, dimension, bins=10):
    dimension = dataset.get_dimension(dimension)
    values = dataset.dimension_values(dimension)
    other_dims = [d for d in dataset.kdims if d is not dimension]
    cats, bins = pd.cut(values, bins, retbins=True)
    hmap = hv.HoloMap(kdims=[dimension])
    for i in range(1, len(bins)):
        start, end = bins[i-1], bins[i]
        mid = np.mean([start, end])
        hmap[mid] = dataset.select(**{dimension.name: (start, end)}).reindex(other_dims)
    return hmap

but I have not tested this.

The text was updated successfully, but these errors were encountered:

philippjfr · 2017-01-28T12:55:07Z

So I've been rethinking this I think rather than implementing another groupby_bin method or folding this into the existing groupby method it would be useful to make the binning a separate operation. I'd suggest calling it resample allowing you to resample the data along any key dimensions by supplying the number of bins or an explicit list of bins. The method would then replace the coordinates that fall into a particular bin with the supplied label or bin coordinate. Note that in order to resample a gridded dataset you will also have to supply an aggregation function. I believe this would be useful independently of groupby, and the groupby_bins example above could be achieved by running something like dataset.resample(x=10).groupby('x').

philippjfr · 2017-01-28T14:11:20Z

Prototyped a rebinning function, which becomes a poor man's datashader when combined with a HeatMap:

https://anaconda.org/philippjfr/rebinning/notebook

jbednar · 2017-01-28T15:40:03Z

Sounds like a good idea, but "resample" doesn't convey what it does to me; isn't it rasterize?

philippjfr · 2017-01-28T16:32:10Z

No, the aggregation steps and the HeatMap constructor are doing the rasterization in those examples. The rebinning step just replaces existing samples with the center co-ordinate (or label) of each bin. To really support this properly we'll have to add support for bin edges to grid based interfaces, which will also unify Histogram and QuadMesh with the other types (as outlined in #547).

philippjfr · 2017-01-28T16:47:15Z

That said I'm also not sure about resample, pandas uses it for largely the same purpose but there it only works for datetime indexes. It really just bins the values along one or more dimensions into the supplied bins along with optional labels for each bin. What it doesn't do is actually aggregate on each n-dimensional bin, i.e. it will retain multiple samples per bin. Rasterization is the combined operation of binning and aggregating 2D coordinates with associated value dimensions.

jbednar · 2017-01-29T16:57:18Z

Ah, true; I didn't look closely enough; "rebin" here is definitely not rasterizing. But it's not "rebinning", either, because there aren't any "bins" (in the histogram sense) at all, whether before or after that particular operation. All this actually does is to snap continuous-valued samples onto a grid, right?

I don't remember why I posted this originally, so I can't remember what is useful about snapping things onto a grid if not also aggregating them to get a single value per grid cell. E.g. I don't see how that makes constructing a heatmap any easier (since one still needs to select values in the bin's range before aggregating them). I'm obviously missing/misremembering something.

jbednar · 2017-01-29T17:02:32Z

I also don't think "resample" is a valid description. For something of that name, I'd imagine reconstructing some underlying continuous function (e.g. a probability density) by interpolation, and then resampling it. E.g. one could resample it on a grid, leaving one sample per grid point, which would amount to rasterization, or one could resample it at arbitrary points, which would justify the more general "resample" description. But this operation requires a grid (unlike resampling) and leaves multiple samples per grid point (which doesn't fit the idea of a continous underlying function), and does no interpolation (which seems unlike resampling). So, unless I'm confused again, this doesn't sound like resampling to me.

philippjfr · 2017-01-29T17:04:33Z

I don't remember why I posted this originally, so I can't remember what is useful about snapping things onto a grid if not also aggregating them to get a single value per grid cell.

You might have a bunch of flight coordinates along with altitude, but want to split them into a number of categories, so you bin the altitudes into three groups 'low', 'medium', 'high', and then groupby altitude, this would be expressed as something like:

dataset.bin(altitude=dict(bins=3, label=['low', 'medium', 'high'])).groupby('altitude')

I don't see how that makes constructing a heatmap any easier (since one still needs to select values in the bin's range before aggregating them). I'm obviously missing/misremembering something.

Don't quite know what you mean.

So, unless I'm confused again, this doesn't sound like resampling to me.

True, it's really just binning the coordinates along one or more columns/dimensions. The main issue here is that our datasets do not yet support storing any actual bins, instead we have to assign a label or simply taking the center of the bin, thereby "snapping" to a grid.

philippjfr · 2017-01-29T17:09:04Z

Maybe it's best to put this on hold until we've made more progress on #547, which will let us represent actual bins.

jbednar · 2017-01-29T17:50:00Z

Sounds like a good idea.

Don't quite know what you mean.

I haven't been able to work out what the actual code you wrote does, so I'm trying to approach this at a fundamental mathematical level, and probably failing, perhaps because all the examples I saw already have aggregation, so I can't see what the actual "rebin" operation does before aggregation.

I'm imagining that what it does is to rewrite the coordinates to snap them to a grid, leaving a flat data structure with points that happen to line up but are otherwise in the same type of data structure that they started in. But maybe that's not true, and it's now hierarchical, with actual container objects for each bin, containing a bunch of points per bin. If there are such containers, then this operation would make complete sense to me as a groupby, i.e. the first part of making a histogram. If there are not, i.e. the resulting data structure is flat but snapped, then I don't see what the utility is. Probably best explained verbally. :-)

philippjfr · 2017-12-22T12:13:26Z

#547 is now merged so we can store bins properly, we could now revisit this.

TylerTCF · 2018-05-01T16:36:25Z

Would this resample allow users to take a datetime kdim and aggregate to a different sampling frequency like hourly measurements being aggregated into daily, monthly, yearly intervals?

philippjfr · 2020-03-23T00:45:57Z

This can now technically be done with the new transform method:

ds = hv.Dataset(np.random.randn(1000, 3), ['x', 'y'], 'z')
ds.transform(x=hv.dim('x').bin(np.linspace(-1, 1, 11))).groupby('x').apply(hv.Scatter)

philippjfr added the type: feature A major new feature label Mar 4, 2017

philippjfr added this to the v2.0 milestone Mar 15, 2017

philippjfr mentioned this issue Sep 16, 2017

Add height_index, width_index, alpha_index, and marker_index to Points/Scatter? #768

Closed

philippjfr modified the milestones: v2.0, v1.10 Dec 22, 2017

philippjfr modified the milestones: v1.10, v2.0 Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groupby for continuous dimensions #1018

Implement groupby for continuous dimensions #1018

jbednar commented Dec 13, 2016 •

edited by philippjfr

Loading

philippjfr commented Jan 28, 2017 •

edited

Loading

philippjfr commented Jan 28, 2017

jbednar commented Jan 28, 2017

philippjfr commented Jan 28, 2017

philippjfr commented Jan 28, 2017

jbednar commented Jan 29, 2017

jbednar commented Jan 29, 2017

philippjfr commented Jan 29, 2017 •

edited

Loading

philippjfr commented Jan 29, 2017

jbednar commented Jan 29, 2017

philippjfr commented Dec 22, 2017

TylerTCF commented May 1, 2018

philippjfr commented Mar 23, 2020

Implement groupby for continuous dimensions #1018

Implement groupby for continuous dimensions #1018

Comments

jbednar commented Dec 13, 2016 • edited by philippjfr Loading

philippjfr commented Jan 28, 2017 • edited Loading

philippjfr commented Jan 28, 2017

jbednar commented Jan 28, 2017

philippjfr commented Jan 28, 2017

philippjfr commented Jan 28, 2017

jbednar commented Jan 29, 2017

jbednar commented Jan 29, 2017

philippjfr commented Jan 29, 2017 • edited Loading

philippjfr commented Jan 29, 2017

jbednar commented Jan 29, 2017

philippjfr commented Dec 22, 2017

TylerTCF commented May 1, 2018

philippjfr commented Mar 23, 2020

jbednar commented Dec 13, 2016 •

edited by philippjfr

Loading

philippjfr commented Jan 28, 2017 •

edited

Loading

philippjfr commented Jan 29, 2017 •

edited

Loading