-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are bloom filters supported on LIST types? #98
Comments
Hi @ljwagerfield ! Been looking into this. It might work, but I think there might also be a bug in the column naming causing issues. So your setup of how you think it might work, is approximately how I think it likely should work (or close to it). So I think this is a bug. The library's bloom filter does not currently handle nested fields at all. (Although most of the pieces are in place for it to do so). For someone who wants to work on making this possible here are some notes:
Here is a simple setup for what "should" work, but doesn't due at least in part to the note ^. const main = async () => {
const file = "parquet-testing/issue-98.parquet";
const schema = new parquet.ParquetSchema({
querystring: {
type: "LIST",
fields: {
list: {
repeated: true,
fields: {
element: {
fields: {
key: { type: "UTF8" },
value: { type: "UTF8" }
}
}
}
}
}
}
});
try {
const writer = await parquet.ParquetWriter.openFile(schema, file, {
bloomFilters: [
{
column: "querystring,list,element,key",
},
],
});
await writer.appendRow({ querystring: { list: [ { element: { key: "foo", value: "bar", }, }, { element: { key: "foo2", value: "bar2", } } ] } });
await writer.close();
} catch (error: any) {
console.log("I'm in the write catch!", error)
}
try {
const reader = await parquet.ParquetReader.openFile(file);
const cursor = reader.getCursor();
console.log("row", await cursor.next());
const metadata = reader.getMetadata();
console.log("metadata", metadata);
const bloomFilters = await reader.getBloomFiltersFor(["querystring,list,element,key"])
console.log("bloomFilters", bloomFilters);
} catch (error: any) {
console.log("I'm in the read catch!", file, error)
}
}
main() |
Aha, interesting! So the Parquet specification does support bloom filters on lists (I wasn't even sure of this), and this library is close to supporting an implementation for that. That's awesome! I don't have any spare cycles at present (or the Parquet knowledge!) to contribute, unfortunately, so am happy if you want to close. Great to know, though, and thanks again! 👍 👍 👍 |
I'll leave it open for now. With the quick test I wrote, I might be able to get a fix in if I can get the time. Others might want to pick it up as well. |
For bloom filters, treat column names as the full path and not just the first part of the path in the schema. Closes #98 with @rlaferla Co-authored-by: Wil Wade <[email protected]>
@ljwagerfield This release is going out with support! https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.3.5 |
Awesome, thanks so much! 🚀 |
Hi there 👋
Firstly, thank you for this amazing library!
I'm curious to know how to add bloom filters to
LIST
types.For example, given this schema:
How do you add a bloom filter for the
querystring.list.element.key
field?I assume the above won't work? (Sorry in advance if that literally is how you do it!)
Thanks in advance!
The text was updated successfully, but these errors were encountered: