Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PooledArray back into the package. #81

Open
davidanthoff opened this issue Nov 1, 2018 · 3 comments
Open

Add support for PooledArray back into the package. #81

davidanthoff opened this issue Nov 1, 2018 · 3 comments

Comments

@davidanthoff
Copy link
Member

This commit removed support for it entirely, not clear to me why. There are some very legit usecases where I think it would be great to get PooledArrays, so I would be in favor of just adding that feature back in. Perfectly fine to have it behind an optional switch.

@shashi
Copy link
Collaborator

shashi commented Nov 1, 2018

The thinking was promotion to PooledArrays of different Ref types (UInt8 .. UInt64) was too expensive in terms of real csv read performance (you only want to read a CSV first, if you're compiling the function 4 times for every string column, that's bad).

In that context it only makes sense to convert string columns to PooledArrays after the column has been read as a StringVector. At which point you might as well do it outside TextParse (e.g. in JuliaDB)

@nalimilan
Copy link

Why not default to UInt32 like CategoricalArrays? That's always a net win in terms of storage, and that suits 99% of use cases. Anyway, that's just an option, people can do something else if they need that.

The advantage of parsing directly to PooledArray is that you don't need to store so many identical strings in memory until you convert the column.

@davidanthoff
Copy link
Member Author

Another thing that would be neat is to be able to specify this separately for different columns.

We are planning/thinking a lot about whole query optimization in Query.jl right now, and I can pretty easily see a scenario where for example something like load("foo.csv") |> @groupby(_.a) |> ... ends up automatically only loading the a column as a PooledArray, but not anything else, based on the analysis of a given query (as a whole, not just an individual query operation). I don't think any of this will appear in the next couple of weeks (or months) in Query.jl, but we have a team here now that is hopefully going to tackle this (pretty big) issue, and so starting to add features like that here might get another good payoff down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants