-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Julia #30
Comments
Now that we have help from ZJ, the first paragraph on results page should be updated. Although, are we waiting for Julia for something? Since ZJ wrote (here) : "Hopefully all packages will work on v1.0 soon so the benchmarks can be incorporated." |
We are not waiting for anything from Julia side already, some initial works on that started in 15e4235. There are few different ways of doing grouping in Julia. We cannot use fastest one, as it is for factors, which is different question than 5 questions currently in scope. There is another grouping for "small" strings and another for regular strings. Code written in Julia has to scale well, also when string is bigger, so probably will need to use slower method. |
What did ZJ mean about all packages being available in v1.0 then? |
No idea, I don't know julia ecosystem at all. It will be useful to know. @JeffBezanson |
It doesn't seems that all packages are available. https://github.com/xiaodaigh/FastGroupBy.jl seems to be not yet "published".
Looking at the summary table of groupby implementations in Julia at https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313
So the only reasonable solution for is to use DataFramesMeta.jl. edit: |
Hi @bkamins, could you please take a look at code in this file. I followed your split_apply_combine.md instruction. I am afraid that |
With current state of DataFrames.jl package this is the current performance you can expect. |
There's also a similar PR against DataFramesMeta: JuliaData/DataFramesMeta.jl#101. It has even more potential for performance since specialized code could be generated (not yet in the present state of the PR, but not very far either). |
done |
Just to let interested parties know. We will soon change #20 three character columns into categorical. It is not fair to force some tools to use character columns when other solutions like spark use character as categorical already. |
Glad to hear this. Indeed we've been thinking about this problem. I'm currently working on making grouping on categorical variables much faster. I'll let you know when it's released. |
Comment from ZJ (@dzj_evalparse) here with a pointer to a Julia repo which had already reproduced these 5 tests in Julia, I assume from seeing the 2014 benchmark before. Which is great.
https://twitter.com/dzj_evalparse/status/1039271981286187008
It looks like it has everything needed to add Julia :
https://discourse.julialang.org/t/group-by-performance-benchmarks-and-recommendations/9313
The text was updated successfully, but these errors were encountered: