-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Prefix, Suffix and Ngram UDFs #12392
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12392 +/- ##
============================================
+ Coverage 61.75% 62.19% +0.43%
+ Complexity 207 198 -9
============================================
Files 2436 2502 +66
Lines 133233 136584 +3351
Branches 20636 21145 +509
============================================
+ Hits 82274 84942 +2668
- Misses 44911 45364 +453
- Partials 6048 6278 +230
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
pinot-common/src/main/java/org/apache/pinot/common/function/scalar/StringFunctions.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there equivalent/similar functions in other commonly used DBs (e.g. PostgreSQL)? We should try to match the behavior
i think postgresql provide a 3-gram module called Do you think i should rename the function similarly? |
* @param maxlength the max length of the prefix strings for the string. | ||
* @return generate an array of prefix strings of the string that are shorter than the specified length. | ||
*/ | ||
@ScalarFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to add alias unique_prefixes
, same for other functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need unique
though? The prefixes will always be unique because they all have different length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg. i think the reason of unique_prefixes is to reserve prefixes
for other purpose or implementations. if no objection, let me use prefixes()
then.
*/ | ||
@ScalarFunction | ||
public static String[] uniquePrefixes(String input, int maxlength) { | ||
ObjectSet<String> prefixSet = new ObjectLinkedOpenHashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a set since all prefixes have different length. Same for other functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks
*/ | ||
@ScalarFunction | ||
public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) { | ||
if (prefix == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to accept null
, you want to annotate it as nullableParameters
. Please also annotate the parameter to be @Nullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not addressed ^^
Take a look at ScalarFunction.class
. You need to annotate it as nullableParameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Jackie-Jiang for pointer. updated
9fa29e7
to
9859704
Compare
*/ | ||
@ScalarFunction | ||
public static String[] prefixes(String input, int maxlength) { | ||
ObjectList<String> prefixList = new ObjectArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectArrayList
is not really buying us anything here.
Given we know the number of prefixes upfront, we can directly allocate the array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced the objectArrayList with a fixSizeArr. thanks
*/ | ||
@ScalarFunction | ||
public static String[] uniquePrefixesWithPrefix(String input, int maxlength, String prefix) { | ||
if (prefix == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not addressed ^^
Take a look at ScalarFunction.class
. You need to annotate it as nullableParameters
@@ -18,6 +18,10 @@ | |||
*/ | |||
package org.apache.pinot.common.function.scalar; | |||
|
|||
import it.unimi.dsi.fastutil.objects.ObjectArrayList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use fastutil here. We can use the java default ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly good
* @param prefix the prefix to be prepended to prefix strings generated. e.g. '^' for regex matching | ||
* @return generate an array of prefix matchers of the string that are shorter than the specified length. | ||
*/ | ||
@ScalarFunction(nullableParameters = true, names = {"prefix"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to alias it to prefix
? I don't think this is really prefix
* @param suffix the suffix string to be appended for suffix strings generated. e.g. '$' for regex matching. | ||
* @return generate an array of suffix matchers of the string that are shorter than the specified length. | ||
*/ | ||
@ScalarFunction(nullableParameters = true, names = {"suffix"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks @Jackie-Jiang
*/ | ||
@ScalarFunction | ||
public static String[] uniqueNgrams(String input, int minGram, int maxGram) { | ||
ObjectSet<String> ngramSet = new ObjectLinkedOpenHashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @Jackie-Jiang ngrams doesn't guarantee to be unique, right? so the usage of Set is to dedup and avoid duplicates.
0af362a
to
179d70a
Compare
Has this been added to the docs @deemoliu ? |
feature
: Adding ngram, prefix, postfix UDFsContext:
We are onboarding a use case and trying the inrease query throughput. We tested the QPS cannot further improved with the existing REGEXP_LIKE queries or text_match queries. The queries as follows
The plan is to generated the derived columns that persisted prefix, postfix, and ngram to use inverted indexes to filter the result fast, and add the text match indexes to do validation after filtering to avoid false positive result.
This patch is created to generate prefix, postfix, and ngrams for a field.
it can be used by the following transformation config