Add support for PooledArray back into the package. #81

davidanthoff · 2018-11-01T16:04:06Z

This commit removed support for it entirely, not clear to me why. There are some very legit usecases where I think it would be great to get PooledArrays, so I would be in favor of just adding that feature back in. Perfectly fine to have it behind an optional switch.

The text was updated successfully, but these errors were encountered:

shashi · 2018-11-01T18:03:33Z

The thinking was promotion to PooledArrays of different Ref types (UInt8 .. UInt64) was too expensive in terms of real csv read performance (you only want to read a CSV first, if you're compiling the function 4 times for every string column, that's bad).

In that context it only makes sense to convert string columns to PooledArrays after the column has been read as a StringVector. At which point you might as well do it outside TextParse (e.g. in JuliaDB)

nalimilan · 2018-11-01T18:17:04Z

Why not default to UInt32 like CategoricalArrays? That's always a net win in terms of storage, and that suits 99% of use cases. Anyway, that's just an option, people can do something else if they need that.

The advantage of parsing directly to PooledArray is that you don't need to store so many identical strings in memory until you convert the column.

davidanthoff · 2018-11-01T19:51:48Z

Another thing that would be neat is to be able to specify this separately for different columns.

We are planning/thinking a lot about whole query optimization in Query.jl right now, and I can pretty easily see a scenario where for example something like load("foo.csv") |> @groupby(_.a) |> ... ends up automatically only loading the a column as a PooledArray, but not anything else, based on the analysis of a given query (as a whole, not just an individual query operation). I don't think any of this will appear in the next couple of weeks (or months) in Query.jl, but we have a team here now that is hopefully going to tackle this (pretty big) issue, and so starting to add features like that here might get another good payoff down the road.

davidanthoff added the enhancement label Nov 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for PooledArray back into the package. #81

Add support for PooledArray back into the package. #81

davidanthoff commented Nov 1, 2018

shashi commented Nov 1, 2018 •

edited

Loading

nalimilan commented Nov 1, 2018

davidanthoff commented Nov 1, 2018

Add support for PooledArray back into the package. #81

Add support for PooledArray back into the package. #81

Comments

davidanthoff commented Nov 1, 2018

shashi commented Nov 1, 2018 • edited Loading

nalimilan commented Nov 1, 2018

davidanthoff commented Nov 1, 2018

shashi commented Nov 1, 2018 •

edited

Loading