-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement groupby for continuous dimensions #1018
Comments
So I've been rethinking this I think rather than implementing another |
Prototyped a rebinning function, which becomes a poor man's datashader when combined with a HeatMap: |
Sounds like a good idea, but "resample" doesn't convey what it does to me; isn't it rasterize? |
No, the aggregation steps and the HeatMap constructor are doing the rasterization in those examples. The rebinning step just replaces existing samples with the center co-ordinate (or label) of each bin. To really support this properly we'll have to add support for bin edges to grid based interfaces, which will also unify Histogram and QuadMesh with the other types (as outlined in #547). |
That said I'm also not sure about |
Ah, true; I didn't look closely enough; "rebin" here is definitely not rasterizing. But it's not "rebinning", either, because there aren't any "bins" (in the histogram sense) at all, whether before or after that particular operation. All this actually does is to snap continuous-valued samples onto a grid, right? I don't remember why I posted this originally, so I can't remember what is useful about snapping things onto a grid if not also aggregating them to get a single value per grid cell. E.g. I don't see how that makes constructing a heatmap any easier (since one still needs to select values in the bin's range before aggregating them). I'm obviously missing/misremembering something. |
I also don't think "resample" is a valid description. For something of that name, I'd imagine reconstructing some underlying continuous function (e.g. a probability density) by interpolation, and then resampling it. E.g. one could resample it on a grid, leaving one sample per grid point, which would amount to rasterization, or one could resample it at arbitrary points, which would justify the more general "resample" description. But this operation requires a grid (unlike resampling) and leaves multiple samples per grid point (which doesn't fit the idea of a continous underlying function), and does no interpolation (which seems unlike resampling). So, unless I'm confused again, this doesn't sound like resampling to me. |
You might have a bunch of flight coordinates along with altitude, but want to split them into a number of categories, so you bin the altitudes into three groups 'low', 'medium', 'high', and then groupby altitude, this would be expressed as something like: dataset.bin(altitude=dict(bins=3, label=['low', 'medium', 'high'])).groupby('altitude')
Don't quite know what you mean.
True, it's really just binning the coordinates along one or more columns/dimensions. The main issue here is that our datasets do not yet support storing any actual bins, instead we have to assign a label or simply taking the center of the bin, thereby "snapping" to a grid. |
Maybe it's best to put this on hold until we've made more progress on #547, which will let us represent actual bins. |
Sounds like a good idea.
I haven't been able to work out what the actual code you wrote does, so I'm trying to approach this at a fundamental mathematical level, and probably failing, perhaps because all the examples I saw already have aggregation, so I can't see what the actual "rebin" operation does before aggregation. I'm imagining that what it does is to rewrite the coordinates to snap them to a grid, leaving a flat data structure with points that happen to line up but are otherwise in the same type of data structure that they started in. But maybe that's not true, and it's now hierarchical, with actual container objects for each bin, containing a bunch of points per bin. If there are such containers, then this operation would make complete sense to me as a groupby, i.e. the first part of making a histogram. If there are not, i.e. the resulting data structure is flat but snapped, then I don't see what the utility is. Probably best explained verbally. :-) |
#547 is now merged so we can store bins properly, we could now revisit this. |
Would this resample allow users to take a datetime kdim and aggregate to a different sampling frequency like hourly measurements being aggregated into daily, monthly, yearly intervals? |
This can now technically be done with the new transform method: ds = hv.Dataset(np.random.randn(1000, 3), ['x', 'y'], 'z')
ds.transform(x=hv.dim('x').bin(np.linspace(-1, 1, 11))).groupby('x').apply(hv.Scatter) |
HoloViews supports a groupby operation for discrete/categorical dimensions, but as far as I can see there is no support for grouping over a continuous dimension, which requires a specified bin width. The xarray interface might provide this already (#804), but for pandas the separate
cut
method would seem to be needed. Philipp suggests adding a method:but I have not tested this.
The text was updated successfully, but these errors were encountered: