Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prefix, Suffix and Ngram UDFs #12392

Merged
merged 6 commits into from
Apr 23, 2024

Conversation

deemoliu
Copy link
Contributor

@deemoliu deemoliu commented Feb 9, 2024

feature: Adding ngram, prefix, postfix UDFs

Context:

We are onboarding a use case and trying the inrease query throughput. We tested the QPS cannot further improved with the existing REGEXP_LIKE queries or text_match queries. The queries as follows

select col1, col2 from table where REPEXP_LIKE(col3, '^data*')
select col1, col2 from table where REGEXP_LIKE(col3, 'data$')
select col1, col2 from table where REGEXP_LIKE(col3, '*data*')
select col1, col2 from table where TEXT_MATCH(col3, '/data*/')
...

The plan is to generated the derived columns that persisted prefix, postfix, and ngram to use inverted indexes to filter the result fast, and add the text match indexes to do validation after filtering to avoid false positive result.

This patch is created to generate prefix, postfix, and ngrams for a field.
it can be used by the following transformation config

       {
          "columnName": "col_prefix",
          "transformFunction": "prefixes(col, 3, null)"
        },
       {
          "columnName": "col_suffix",
          "transformFunction": "suffixes(col, 3, null)"
        },
       {
          "columnName": "col_3gram",
          "transformFunction": "ngrams(col, 3)"
        },
       {
          "columnName": "col_1gram_2gram_3gram",
          "transformFunction": "ngrams(col, 1, 3)"
        },

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2024

Codecov Report

Attention: Patch coverage is 89.18919% with 4 lines in your changes missing coverage. Please review.

Project coverage is 62.19%. Comparing base (59551e4) to head (179d70a).
Report is 892 commits behind head on master.

Files Patch % Lines
.../pinot/common/function/scalar/StringFunctions.java 89.18% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12392      +/-   ##
============================================
+ Coverage     61.75%   62.19%   +0.43%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2502      +66     
  Lines        133233   136584    +3351     
  Branches      20636    21145     +509     
============================================
+ Hits          82274    84942    +2668     
- Misses        44911    45364     +453     
- Partials       6048     6278     +230     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.15% <89.18%> (+0.44%) ⬆️
java-21 62.05% <89.18%> (+0.43%) ⬆️
skip-bytebuffers-false 62.17% <89.18%> (+0.42%) ⬆️
skip-bytebuffers-true 35.02% <89.18%> (+7.30%) ⬆️
temurin 62.19% <89.18%> (+0.43%) ⬆️
unittests 62.18% <89.18%> (+0.43%) ⬆️
unittests1 46.73% <89.18%> (-0.16%) ⬇️
unittests2 27.95% <0.00%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there equivalent/similar functions in other commonly used DBs (e.g. PostgreSQL)? We should try to match the behavior

@deemoliu
Copy link
Contributor Author

deemoliu commented Mar 6, 2024

Are there equivalent/similar functions in other commonly used DBs (e.g. PostgreSQL)? We should try to match the behavior

i think postgresql provide a 3-gram module called pg_trgm where pg means PostgreSQL.
it including the following functions
show_trgm (text)

Do you think i should rename the function similarly?

* @param maxlength the max length of the prefix strings for the string.
* @return generate an array of prefix strings of the string that are shorter than the specified length.
*/
@ScalarFunction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to add alias unique_prefixes, same for other functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need unique though? The prefixes will always be unique because they all have different length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg. i think the reason of unique_prefixes is to reserve prefixes for other purpose or implementations. if no objection, let me use prefixes() then.

*/
@ScalarFunction
public static String[] uniquePrefixes(String input, int maxlength) {
ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a set since all prefixes have different length. Same for other functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

*/
@ScalarFunction
public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) {
if (prefix == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to accept null, you want to annotate it as nullableParameters. Please also annotate the parameter to be @Nullable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not addressed ^^
Take a look at ScalarFunction.class. You need to annotate it as nullableParameters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Jackie-Jiang for pointer. updated

@deemoliu deemoliu force-pushed the qiaochu/ngram-udf branch from 9fa29e7 to 9859704 Compare April 15, 2024 22:01
*/
@ScalarFunction
public static String[] prefixes(String input, int maxlength) {
ObjectList<String> prefixList = new ObjectArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectArrayList is not really buying us anything here.
Given we know the number of prefixes upfront, we can directly allocate the array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced the objectArrayList with a fixSizeArr. thanks

*/
@ScalarFunction
public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) {
if (prefix == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not addressed ^^
Take a look at ScalarFunction.class. You need to annotate it as nullableParameters

@@ -18,6 +18,10 @@
*/
package org.apache.pinot.common.function.scalar;

import it.unimi.dsi.fastutil.objects.ObjectArrayList;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use fastutil here. We can use the java default ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good

* @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching
* @return generate an array of prefix matchers of the string that are shorter than the specified length.
*/
@ScalarFunction(nullableParameters = true, names = {"prefix"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to alias it to prefix? I don't think this is really prefix

* @param suffix the suffix string to be appended for suffix strings generated. e.g. '$' for regex matching.
* @return generate an array of suffix matchers of the string that are shorter than the specified length.
*/
@ScalarFunction(nullableParameters = true, names = {"suffix"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks @Jackie-Jiang

*/
@ScalarFunction
public static String[] uniqueNgrams(String input, int minGram, int maxGram) {
ObjectSet<String> ngramSet = new ObjectLinkedOpenHashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @Jackie-Jiang ngrams doesn't guarantee to be unique, right? so the usage of Set is to dedup and avoid duplicates.

@deemoliu deemoliu force-pushed the qiaochu/ngram-udf branch from 0af362a to 179d70a Compare April 23, 2024 17:35
@Jackie-Jiang Jackie-Jiang merged commit 36c4b9a into apache:master Apr 23, 2024
19 of 20 checks passed
@npawar
Copy link
Contributor

npawar commented May 22, 2024

Has this been added to the docs @deemoliu ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants