You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit removed support for it entirely, not clear to me why. There are some very legit usecases where I think it would be great to get PooledArrays, so I would be in favor of just adding that feature back in. Perfectly fine to have it behind an optional switch.
The text was updated successfully, but these errors were encountered:
The thinking was promotion to PooledArrays of different Ref types (UInt8 .. UInt64) was too expensive in terms of real csv read performance (you only want to read a CSV first, if you're compiling the function 4 times for every string column, that's bad).
In that context it only makes sense to convert string columns to PooledArrays after the column has been read as a StringVector. At which point you might as well do it outside TextParse (e.g. in JuliaDB)
Why not default to UInt32 like CategoricalArrays? That's always a net win in terms of storage, and that suits 99% of use cases. Anyway, that's just an option, people can do something else if they need that.
The advantage of parsing directly to PooledArray is that you don't need to store so many identical strings in memory until you convert the column.
Another thing that would be neat is to be able to specify this separately for different columns.
We are planning/thinking a lot about whole query optimization in Query.jl right now, and I can pretty easily see a scenario where for example something like load("foo.csv") |> @groupby(_.a) |> ... ends up automatically only loading the a column as a PooledArray, but not anything else, based on the analysis of a given query (as a whole, not just an individual query operation). I don't think any of this will appear in the next couple of weeks (or months) in Query.jl, but we have a team here now that is hopefully going to tackle this (pretty big) issue, and so starting to add features like that here might get another good payoff down the road.
This commit removed support for it entirely, not clear to me why. There are some very legit usecases where I think it would be great to get
PooledArray
s, so I would be in favor of just adding that feature back in. Perfectly fine to have it behind an optional switch.The text was updated successfully, but these errors were encountered: