Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are bloom filters supported on LIST types? #98

Closed
ljwagerfield opened this issue Nov 1, 2023 · 5 comments · Fixed by #105
Closed

Are bloom filters supported on LIST types? #98

ljwagerfield opened this issue Nov 1, 2023 · 5 comments · Fixed by #105
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@ljwagerfield
Copy link

ljwagerfield commented Nov 1, 2023

Hi there 👋

Firstly, thank you for this amazing library!

I'm curious to know how to add bloom filters to LIST types.

For example, given this schema:

{
  querystring: {
    type: "LIST",
    fields: {
      list: {
        repeated: true,
        fields: {
          element: {
            fields: {
              key: { type: "UTF8" },
              value: { type: "UTF8" }
            }
          }
        }
      }
    }
  }
}

How do you add a bloom filter for the querystring.list.element.key field?

[
  {
    column: "querystring.list.element.key",
    numDistinct: 100
  }
]

I assume the above won't work? (Sorry in advance if that literally is how you do it!)

Thanks in advance!

@wilwade
Copy link
Member

wilwade commented Nov 3, 2023

Hi @ljwagerfield !

Been looking into this. It might work, but I think there might also be a bug in the column naming causing issues.

So your setup of how you think it might work, is approximately how I think it likely should work (or close to it).

So I think this is a bug. The library's bloom filter does not currently handle nested fields at all. (Although most of the pieces are in place for it to do so).

For someone who wants to work on making this possible here are some notes:

  • writeBloomFilters needs to handle all the possible nested "columns" not just the top level ones. getColumn from the reader.ts file is a good example of going down into the various groups.
  • Currently the writeBloomFilters will write one if the opts.bloomFilters has a column with name key or value instead of respecting querystring,list,element,key

Here is a simple setup for what "should" work, but doesn't due at least in part to the note ^.

const main = async () => {

  const file = "parquet-testing/issue-98.parquet";

  const schema = new parquet.ParquetSchema({
    querystring: {
      type: "LIST",
      fields: {
        list: {
          repeated: true,
          fields: {
            element: {
              fields: {
                key: { type: "UTF8" },
                value: { type: "UTF8" }
              }
            }
          }
        }
      }
    }
  });

  try {
    const writer = await parquet.ParquetWriter.openFile(schema, file, {
      bloomFilters: [
        {
          column: "querystring,list,element,key",
        },
      ],
    });

    await writer.appendRow({ querystring: { list: [ { element: { key: "foo", value: "bar", }, }, { element: { key: "foo2", value: "bar2", } } ] } });
    await writer.close();
  } catch (error: any) {
    console.log("I'm in the write catch!", error)
  }

  try {
    const reader = await parquet.ParquetReader.openFile(file);
    const cursor = reader.getCursor();
    console.log("row", await cursor.next());
    const metadata = reader.getMetadata();
    console.log("metadata", metadata);
    const bloomFilters = await reader.getBloomFiltersFor(["querystring,list,element,key"])
      console.log("bloomFilters", bloomFilters);
  } catch (error: any) {
    console.log("I'm in the read catch!", file, error)
  }
}

main()

@ljwagerfield
Copy link
Author

Aha, interesting!

So the Parquet specification does support bloom filters on lists (I wasn't even sure of this), and this library is close to supporting an implementation for that.

That's awesome!

I don't have any spare cycles at present (or the Parquet knowledge!) to contribute, unfortunately, so am happy if you want to close.

Great to know, though, and thanks again! 👍 👍 👍

@wilwade
Copy link
Member

wilwade commented Nov 3, 2023

I'll leave it open for now. With the quick test I wrote, I might be able to get a fix in if I can get the time. Others might want to pick it up as well.

@wilwade wilwade added good first issue Good for newcomers bug Something isn't working labels Nov 3, 2023
shannonwells added a commit that referenced this issue Nov 28, 2023
For bloom filters, treat column names as the full path and not just the first part of the path in the schema.

Closes #98 
with @rlaferla 

Co-authored-by: Wil Wade <[email protected]>
@wilwade
Copy link
Member

wilwade commented Nov 28, 2023

@ljwagerfield This release is going out with support!

https://github.com/LibertyDSNP/parquetjs/releases/tag/v1.3.5

@ljwagerfield
Copy link
Author

Awesome, thanks so much! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants