Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift expensive Regex construction from DateFormat method body. #43647

Merged
merged 1 commit into from
Jan 6, 2022

Conversation

Sacha0
Copy link
Member

@Sacha0 Sacha0 commented Jan 3, 2022

Constructing the Regex touched in this commit can represent a
significant fraction (e.g. half or better) of the runtime of
the DateFormat method touched in this commit. To make this
DateFormat method more efficient, let's lift that Regex
construction out of that method body.

Data:

julia> using Dates

julia> using BenchmarkTools

julia> format = "y-m-dTH:M:S.s"
"y-m-dTH:M:S.s"

On master

julia> @benchmark DateFormat($format)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.448 μs …  7.292 ms  ┊ GC (min … max): 0.00% … 45.46%
 Time  (median):     15.276 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.787 μs ± 73.010 μs  ┊ GC (mean ± σ):  1.97% ±  0.45%

   ▄▃ █▄
  ▄█████▄▇▇▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  12.4 μs         Histogram: frequency by time        48.4 μs <

 Memory estimate: 3.15 KiB, allocs estimate: 57.

and on this branch

julia> @benchmark DateFormat($format)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.451 μs … 561.158 μs  ┊ GC (min … max): 0.00% … 98.43%
 Time  (median):     5.955 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.551 μs ±   9.486 μs  ┊ GC (mean ± σ):  2.46% ±  1.71%

  ▇█     ▃
  ██▆▆▃▂▄█▇▇▆▄▃▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  5.45 μs         Histogram: frequency by time        13.8 μs <

 Memory estimate: 2.92 KiB, allocs estimate: 53.

Best!

@Sacha0 Sacha0 added the performance Must go faster label Jan 3, 2022
KristofferC
KristofferC approved these changes Jan 3, 2022
@KristofferC KristofferC added the dates Dates, times, and the Dates stdlib module label Jan 3, 2022
@KristofferC
Copy link
Member

KristofferC commented Jan 3, 2022

Actually, does CONVERSION_SPECIFIERS get mutated somewhere? In that case, maybe check that that hadn't change and invalidate the regex on that case.

@Sacha0
Copy link
Member Author

Sacha0 commented Jan 3, 2022

Actually, does CONVERSION_SPECIFIERS get mutated somewhere? In that case, maybe check that that hadn't change and invalidate the regex on that case.

Good catch! It doesn't seem to be mutated anywhere in Dates, but this comment (Dates/src/io.jl:291) indicates that packages may mutate it:

# Map conversion specifiers or character codes to tokens.
# Note: Allow addition of new character codes added by packages

Hm 🤔

@Sacha0
Copy link
Member Author

Sacha0 commented Jan 3, 2022

It looks like the failing test captures precisely that possibility:

    try
        # Make 'Z' into a specifier
        Dates.CONVERSION_SPECIFIERS['Z'] = Zulu
        Dates.CONVERSION_DEFAULTS[Zulu] = ""

        @test Dates.parse_components(str, Dates.DateFormat(format)) == [parsed; Zulu("Z")]
        @test Dates.parse_components(str, Dates.DateFormat(escaped_format)) == parsed
    finally
        delete!(Dates.CONVERSION_SPECIFIERS, 'Z')
        delete!(Dates.CONVERSION_DEFAULTS, Zulu)
    end

@Sacha0
Copy link
Member Author

Sacha0 commented Jan 3, 2022

maybe check that that hadn't change and invalidate the regex on that case.

I'm sketching an implementation now, but along similar lines: It seems that updating CONVERSION_SPECIFIERS requires updating CONVERSION_DEFAULTS, and this mechanism is internal (i.e. documented only in the comments adjacent to those definitions). Perhaps introducing a little internal interface for changing these would be worthwhile, e.g. an internal method that accepts a specifier and default, that would update both the existing constants and also the const regex? That would promote consistency of those objects, and allow us to avoid the cost of checking CONVERSION_SPECIFIERS for changes on each DateFormat call. Something to consider :).

@Sacha0
Copy link
Member Author

Sacha0 commented Jan 4, 2022

Reworked to check for changes to CONVERSION_SPECIFIERS and update the cached Regex accordingly. The check seems to consume 50-80ns locally, which seems to be a couple orders below the runtime of DateFormat(format, locale), which is to say it's more or less negligible absent additional substantial optimization of the rest of DateFormat(format, locale).

@Sacha0 Sacha0 requested a review from KristofferC January 4, 2022 00:33
@JeffBezanson
Copy link
Member

Looks good. I wonder if it would be enough just to check the length? Hopefully nobody deletes format characters...? 😄

@Sacha0
Copy link
Member Author

Sacha0 commented Jan 4, 2022

Hopefully nobody deletes format characters...? 😄

I sure hope not 😄

I wonder if it would be enough just to check the length?

Perhaps so 🤔. The cost of checking the hash of the keys seems negligible relative to the overall cost of the operation, though, so perhaps let's defer that (potentially risky) optimization till the cost of the check becomes relevant? :)

CI seems happy, Kristoffer seems happy (modulo the now-addressed concern), and Jeff seems happy. So absent objections or requests for time, I plan to merge this patch later today :).

@Sacha0 Sacha0 added backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Jan 4, 2022
@Sacha0
Copy link
Member Author

Sacha0 commented Jan 4, 2022

Marked for backport to 1.6 as purely a performance optimization, and for backport to 1.7 as a partial mitigation of #43610.

# To understand this block, please see the comments attached to the
# definitions of DATEFORMAT_REGEX_CACHE and CONVERSION_SPECIFIERS_KEYS_HASH above.
conversion_specifiers_keys_hash = hash(keys(CONVERSION_SPECIFIERS))
if conversion_specifiers_keys_hash != CONVERSION_SPECIFIERS_KEYS_HASH[]
Copy link
Member

@KristofferC KristofferC Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any concurrency concerns here? Maybe a lock is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically all of these objects need locks I think, i.e. CONVERSION_SPECIFIERS, CONVERSION_DEFAULTS, CONVERSION_TRANSLATIONS, DATEFORMAT_REGEX_CACHE, and CONVERSION_SPECIFIERS_KEY_HASH. But addressing that strikes me as beyond the scope of this PR: To do so properly, we would need to introduce an interface that maintains consistency between all of these objects, and introduce appropriate reader/writer locks.

Copy link
Member

@KristofferC KristofferC Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that it used to be thread safe to create date formats but now it no longer is. I would have thought slapping a lock around the part that reads and updates the cache would make this version as safe as the original one. The cost of the lock should be insignificant compared to the work of the function. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. It also occurs to me that this would do the wrong thing in case of a hash collision, though of course that is very unlikely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding, your concern is specifically about consistency of and between DATEFORMAT_REGEX_CACHE and CONVERSION_SPECIFIERS_KEYS_HASH across threads constructing DateFormats, when other threads are simultaneously modifying CONVERSION_SPECIFIERS? And not about the existing potential for races when some threads modify CONVERSION_SPECIFIERS (and friends) while others construct DateFormats (or perform other operations that consume CONVERSION_SPECIFIERS and friends)?

Sure, the former we could mitigate pretty simply via an exclusive lock on CONVERSION_SPECIFIERS_HASH / DATEFORMAT_REGEX_CACHE (at risk of introducing unnecessary contention) or less simply via a combination of shared and exclusive locks (to avoid introducing unnecessary contention). The latter seems like a broader, preexisting issue.

It also occurs to me that this would do the wrong thing in case of a hash collision, though of course that is very unlikely.

Both of these concerns seem pretty unlikely to bite in practice :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a new commit that introduces an exclusive lock (DATEFORMAT_REGEX_LOCK) guarding access to DATEFORMAT_REGEX_CACHE and DATEFORMAT_REGEX_HASH (née CONVERSION_SPECIFIERS_KEYS_HASH, which seemed a bit wordy if more descriptive); hopefully the potential contention isn't so bad. Does that address your concern, or are you also concerned about the preexisting issues?

Constructing the Regex touched in this commit can represent a
significant fraction (e.g. half or better) of the runtime of
the DateFormat method touched in this commit. To make this
DateFormat method more efficient, let's lift that Regex
construction out of that method body.
@Sacha0 Sacha0 force-pushed the sv-faster-dateformat branch from bc84cdb to 7fb5bb8 Compare January 4, 2022 23:19
@KristofferC KristofferC mentioned this pull request Jan 5, 2022
23 tasks
@Sacha0
Copy link
Member Author

Sacha0 commented Jan 5, 2022

CI seems happy. Any remaining concerns? :) Absent other concerns or requests for time, I plan to merge this later today or tomorrow morning.

@Sacha0 Sacha0 merged commit f5a5149 into master Jan 6, 2022
@Sacha0 Sacha0 deleted the sv-faster-dateformat branch January 6, 2022 19:10
@Sacha0
Copy link
Member Author

Sacha0 commented Jan 6, 2022

Thanks gentlemen! :)

@KristofferC KristofferC mentioned this pull request Jan 10, 2022
50 tasks
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dates Dates, times, and the Dates stdlib module performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants