Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Julia #30

Closed
mattdowle opened this issue Sep 11, 2018 · 11 comments
Closed

Add Julia #30

mattdowle opened this issue Sep 11, 2018 · 11 comments

Comments

@mattdowle
Copy link
Contributor

Comment from ZJ (@dzj_evalparse) here with a pointer to a Julia repo which had already reproduced these 5 tests in Julia, I assume from seeing the 2014 benchmark before. Which is great.
https://twitter.com/dzj_evalparse/status/1039271981286187008

It looks like it has everything needed to add Julia :
https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313

@jangorecki jangorecki self-assigned this Sep 20, 2018
@mattdowle
Copy link
Contributor Author

mattdowle commented Sep 22, 2018

Now that we have help from ZJ, the first paragraph on results page should be updated.
from
"We hope to add Julia too and are looking for help to do so."
maybe to :
"We are working on adding Julia (link to this #30)."

Although, are we waiting for Julia for something? Since ZJ wrote (here) : "Hopefully all packages will work on v1.0 soon so the benchmarks can be incorporated."

@jangorecki
Copy link
Contributor

jangorecki commented Sep 22, 2018

We are not waiting for anything from Julia side already, some initial works on that started in 15e4235. There are few different ways of doing grouping in Julia. We cannot use fastest one, as it is for factors, which is different question than 5 questions currently in scope. There is another grouping for "small" strings and another for regular strings. Code written in Julia has to scale well, also when string is bigger, so probably will need to use slower method.

@mattdowle
Copy link
Contributor Author

mattdowle commented Sep 22, 2018

What did ZJ mean about all packages being available in v1.0 then?
I'm uncomfortable not using the fastest method in Julia. We could add a 6th, 7th test etc to show those fast features, e.g. by adding a factor column to the test data.

@jangorecki
Copy link
Contributor

No idea, I don't know julia ecosystem at all. It will be useful to know. @JeffBezanson

@jangorecki
Copy link
Contributor

jangorecki commented Sep 30, 2018

It doesn't seems that all packages are available. https://github.com/xiaodaigh/FastGroupBy.jl seems to be not yet "published".
Additionally

The fastby works on String type as well but is still slower than countmap and uses MUCH more RAM and therefore is NOT recommended (at this stage).

Looking at the summary table of groupby implementations in Julia at https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313
It seems that it make sense to wait for fastby to be improved, and published, which "This hasnt been uodated for Julia v1.0.".
Other available options:

  • IndexedTables.jl - not possible as uses index
  • DataFramesMeta.jl - hashtable, recommended only for less groups
  • Query.jl - "Do NOT use if performance matters"

So the only reasonable solution for is to use DataFramesMeta.jl.

edit:
As suggested in xiaodaigh/FastGroupBy.jl#7 I will use DataFrames that in future will use FastGroupBy when it will be mature enough.

jangorecki added a commit that referenced this issue Oct 2, 2018
@jangorecki
Copy link
Contributor

jangorecki commented Oct 5, 2018

Hi @bkamins, could you please take a look at code in this file. I followed your split_apply_combine.md instruction. I am afraid that by(...); do df; DataFrame(...); end; incur significant performance penalty. Are you able to suggests better way to use DataFrames here? Adjusting names could be probably made with zero cost after using aggregate but running two different aggregate function on two different columns doesn't seems to be supported in aggregate?

@bkamins
Copy link
Contributor

bkamins commented Oct 5, 2018

With current state of DataFrames.jl package this is the current performance you can expect.
A PR is underway to speed it up JuliaData/DataFrames.jl#1520.

@nalimilan
Copy link
Contributor

There's also a similar PR against DataFramesMeta: JuliaData/DataFramesMeta.jl#101. It has even more potential for performance since specialized code could be generated (not yet in the present state of the PR, but not very far either).

@jangorecki
Copy link
Contributor

done

@jangorecki
Copy link
Contributor

jangorecki commented Nov 1, 2018

Just to let interested parties know. We will soon change #20 three character columns into categorical. It is not fair to force some tools to use character columns when other solutions like spark use character as categorical already.

@nalimilan
Copy link
Contributor

Glad to hear this. Indeed we've been thinking about this problem. I'm currently working on making grouping on categorical variables much faster. I'll let you know when it's released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants