Add Julia #30

mattdowle · 2018-09-11T18:07:14Z

Comment from ZJ (@dzj_evalparse) here with a pointer to a Julia repo which had already reproduced these 5 tests in Julia, I assume from seeing the 2014 benchmark before. Which is great.
https://twitter.com/dzj_evalparse/status/1039271981286187008

It looks like it has everything needed to add Julia :
https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313

mattdowle · 2018-09-22T02:49:25Z

Now that we have help from ZJ, the first paragraph on results page should be updated.
from
"We hope to add Julia too and are looking for help to do so."
maybe to :
"We are working on adding Julia (link to this #30)."

Although, are we waiting for Julia for something? Since ZJ wrote (here) : "Hopefully all packages will work on v1.0 soon so the benchmarks can be incorporated."

jangorecki · 2018-09-22T05:23:56Z

We are not waiting for anything from Julia side already, some initial works on that started in 15e4235. There are few different ways of doing grouping in Julia. We cannot use fastest one, as it is for factors, which is different question than 5 questions currently in scope. There is another grouping for "small" strings and another for regular strings. Code written in Julia has to scale well, also when string is bigger, so probably will need to use slower method.

mattdowle · 2018-09-22T05:38:42Z

What did ZJ mean about all packages being available in v1.0 then?
I'm uncomfortable not using the fastest method in Julia. We could add a 6th, 7th test etc to show those fast features, e.g. by adding a factor column to the test data.

jangorecki · 2018-09-22T05:45:26Z

No idea, I don't know julia ecosystem at all. It will be useful to know. @JeffBezanson

jangorecki · 2018-09-30T06:11:20Z

It doesn't seems that all packages are available. https://github.com/xiaodaigh/FastGroupBy.jl seems to be not yet "published".
Additionally

The fastby works on String type as well but is still slower than countmap and uses MUCH more RAM and therefore is NOT recommended (at this stage).

Looking at the summary table of groupby implementations in Julia at https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313
It seems that it make sense to wait for fastby to be improved, and published, which "This hasnt been uodated for Julia v1.0.".
Other available options:

IndexedTables.jl - not possible as uses index
DataFramesMeta.jl - hashtable, recommended only for less groups
Query.jl - "Do NOT use if performance matters"

So the only reasonable solution for is to use DataFramesMeta.jl.

edit:
As suggested in xiaodaigh/FastGroupBy.jl#7 I will use DataFrames that in future will use FastGroupBy when it will be mature enough.

jangorecki · 2018-10-05T09:03:39Z

Hi @bkamins, could you please take a look at code in this file. I followed your split_apply_combine.md instruction. I am afraid that by(...); do df; DataFrame(...); end; incur significant performance penalty. Are you able to suggests better way to use DataFrames here? Adjusting names could be probably made with zero cost after using aggregate but running two different aggregate function on two different columns doesn't seems to be supported in aggregate?

bkamins · 2018-10-05T12:58:09Z

With current state of DataFrames.jl package this is the current performance you can expect.
A PR is underway to speed it up JuliaData/DataFrames.jl#1520.

nalimilan · 2018-10-05T13:32:36Z

There's also a similar PR against DataFramesMeta: JuliaData/DataFramesMeta.jl#101. It has even more potential for performance since specialized code could be generated (not yet in the present state of the PR, but not very far either).

jangorecki · 2018-10-12T22:24:35Z

done

jangorecki · 2018-11-01T15:11:37Z

Just to let interested parties know. We will soon change #20 three character columns into categorical. It is not fair to force some tools to use character columns when other solutions like spark use character as categorical already.

nalimilan · 2018-11-01T15:54:32Z

Glad to hear this. Indeed we've been thinking about this problem. I'm currently working on making grouping on categorical variables much faster. I'll let you know when it's released.

jangorecki self-assigned this Sep 20, 2018

jangorecki added a commit that referenced this issue Oct 2, 2018

working julia aggregation, #30

e37b8de

bkamins mentioned this issue Oct 5, 2018

Improve performance of by() using NamedTuples JuliaData/DataFrames.jl#1520

Merged

jangorecki closed this as completed Oct 12, 2018

bkamins mentioned this issue Nov 1, 2018

Use faster hashing approach for first CategoricalVector grouping key JuliaData/DataFrames.jl#1565

Merged

nalimilan mentioned this issue Nov 16, 2018

Add fast grouping method for CategoricalArray keys JuliaData/DataFrames.jl#1600

Merged

jangorecki added juliadf new solution labels Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Julia #30

Add Julia #30

mattdowle commented Sep 11, 2018

mattdowle commented Sep 22, 2018 •

edited

Loading

jangorecki commented Sep 22, 2018 •

edited

Loading

mattdowle commented Sep 22, 2018 •

edited

Loading

jangorecki commented Sep 22, 2018

jangorecki commented Sep 30, 2018 •

edited

Loading

jangorecki commented Oct 5, 2018 •

edited

Loading

bkamins commented Oct 5, 2018

nalimilan commented Oct 5, 2018

jangorecki commented Oct 12, 2018

jangorecki commented Nov 1, 2018 •

edited

Loading

nalimilan commented Nov 1, 2018

Add Julia #30

Add Julia #30

Comments

mattdowle commented Sep 11, 2018

mattdowle commented Sep 22, 2018 • edited Loading

jangorecki commented Sep 22, 2018 • edited Loading

mattdowle commented Sep 22, 2018 • edited Loading

jangorecki commented Sep 22, 2018

jangorecki commented Sep 30, 2018 • edited Loading

jangorecki commented Oct 5, 2018 • edited Loading

bkamins commented Oct 5, 2018

nalimilan commented Oct 5, 2018

jangorecki commented Oct 12, 2018

jangorecki commented Nov 1, 2018 • edited Loading

nalimilan commented Nov 1, 2018

mattdowle commented Sep 22, 2018 •

edited

Loading

jangorecki commented Sep 22, 2018 •

edited

Loading

mattdowle commented Sep 22, 2018 •

edited

Loading

jangorecki commented Sep 30, 2018 •

edited

Loading

jangorecki commented Oct 5, 2018 •

edited

Loading

jangorecki commented Nov 1, 2018 •

edited

Loading