-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve regexp_match performance by avoiding cloning Regex #8631
Conversation
/// TODO: Remove this once it is included in arrow-rs new release. | ||
fn _regexp_match<OffsetSize: OffsetSizeTrait>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go to submit this fix and benchmark to arrow-rs later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya that awesome. Are you planning to move this change to arrow-rs kernel? |
Yea, as mentioned #8631 (comment), I will submit this fix with a benchmark on the kernel to arrow-rs later. |
regex_array: &GenericStringArray<OffsetSize>, | ||
flags_array: Option<&GenericStringArray<OffsetSize>>, | ||
) -> std::result::Result<ArrayRef, ArrowError> { | ||
let mut patterns: std::collections::HashMap<String, Regex> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome -- I think we can do better still (this will still recompile each regex per batch I think -- we can do it once per plan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is per batch (not per row). Definitely if we do the regex compilation per query plan, it will be better. 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, to clarify it if the long PR description is overlooked by reviewers. This hash map is already in the arrow-rs kernel. The actual fix here is to avoid expensive cloning of Regex
per row. I described the reason in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb are you thinking to have some static shared memory pool per query to make static values to be calculated only once per query? we can prob do that with lazy_static
and std::sync::Once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @viirya -- I will review this PR later today
(Some(value), Some(pattern)) => { | ||
let existing_pattern = patterns.get(&pattern); | ||
let re = match existing_pattern { | ||
Some(re) => re, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actual fix. Getting rid of cloning Regex
per row.
)) | ||
})?; | ||
patterns.insert(pattern.clone(), re); | ||
patterns.get(&pattern).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here too.
@@ -58,7 +59,7 @@ pub fn regexp_match<T: OffsetSizeTrait>(args: &[ArrayRef]) -> Result<ArrayRef> { | |||
2 => { | |||
let values = as_generic_string_array::<T>(&args[0])?; | |||
let regex = as_generic_string_array::<T>(&args[1])?; | |||
compute::regexp_match(values, regex, None).map_err(|e| arrow_datafusion_err!(e)) | |||
_regexp_match(values, regex, None).map_err(|e| arrow_datafusion_err!(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we're also are not supporting the scalar version (I.e. regex being a literatal value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you did some testing already apache/arrow-rs#5235 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have bigger "plans" -- #8051 to specialize the constant arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, for the ScalarUDF in DataFusion, we probably can rely on further optimization like #8051. Per kernel perspective, I will probably submit a Datum version later as mentioned apache/arrow-rs#5235 (comment).
I also verified that this PR improves the entire query from #8492, which also uses
I double checked at it seems like File Edit Options Buffers Tools SQL Help
SELECT
month,
ext,
COUNT(DISTINCT project_name) AS project_count
FROM (
SELECT
project_name,
DATE_TRUNC('month', uploaded_on) AS month,
NULLIF(
REPLACE(
REPLACE(
REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_MATCH(path, CONCAT('(', '.([a-z0-9]+)$', ')'))[2],
'cxx|cpp|cc|c|hpp|h',
'C/C++',
'g'
),
'^f.*$',
'Fortran',
'g'
),
'rs',
'Rust'
),
'go',
'Go'
),
'asm',
'Assembly'
),
''
) AS ext
FROM pypi
WHERE COALESCE(
ARRAY_LENGTH(
REGEXP_MATCH(path, '.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
) > 0,
FALSE
)
AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC Here is the explain
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that apache/arrow-rs#5235 also changes regexp_is_match_utf8
which is called here:
https://github.com/apache/arrow-datafusion/blob/03c2ef46f2d88fb015ee305ab67df6d930b780e2/datafusion/physical-expr/src/expressions/binary.rs#L36-L37
(maybe via LIKE
/ ILIKE
?)
We can probably
Headline:
1 row in set. Query took 829.932 seconds.
vs
1 row in set. Query took 33.935 seconds.
(aka 37x faster 🚀 )
Thanks @viirya
Co-authored-by: Andrew Lamb <[email protected]>
I also ran the query under the profiler with this branch, and it still spends a significant amount of time compiling the regex, so I think #8051 can still get us even more |
Hmm, impl Regex {
pub fn new(re: &str) -> Result<Regex, Error> {
RegexBuilder::new(re).build()
}
}
impl RegexBuilder {
/// Compiles the pattern given to `RegexBuilder::new` with the
/// ...
pub fn build(&self) -> Result<Regex, Error> {
...
}
} Note that this is where we catch compilation error: let re = Regex::new(pattern.as_str()).map_err(|e| {
ArrowError::ComputeError(format!(
"Regular expression did not compile: {e:?}"
))
})?; |
It looks like |
Ah, that makes sense (that captures_at is the actual matching). Maybe pre-compiling won't get us as much afterall |
Yea, just cleaned |
Thanks @alamb @comphead @Dandandan |
This is great progress. How do we get from once-per-batch to once-per-plan? |
One way would be #8051 |
Which issue does this PR close?
Closes #8492.
Rationale for this change
regexp_match
scalar expression is quite slow in DataFusion now. Using https://github.com/pypi-data/data/releases data to do test on M1 Mac laptop:After this change:
It is not too much related to padding constant regex to an array and recompiling regex per row as discussed in (#8492). Because arrow-rs
regexp_match
kernel internally maintains a hash map on duplicate regex patterns, so same regex pattern is only compiled once per batch. To verify it, I also did a factoring to use single regex if the regex is a constant, and the result confirms that because the performance didn't get better.After playing it around, the bad performance is due to cloning
Regex
per row.Regex
has aCachePool
. CloningRegex
will create a freshCachePool
(it is designed to avoid sharing the cache among multiple threads) and it is somehow expensive to re-creating cache space per row instead of reusing the cache.What changes are included in this PR?
Removing unnecessary and expensive cloning on
Regex
.Are these changes tested?
Are there any user-facing changes?