Optimise ChainId epoch format parsing #725

mina86 · 2023-06-21T14:06:32Z

from_string checks if argument is in epoch format and then calls
chain_version which does it again. chain_version and with_version use
regex to check if string is in epoch format and then not only parse
the string again but unnecessarily collect dash-separated parts into
a new vector.

Optimise it by providing split_chain_id method which checks the format
and returns individual name and version parts of the id if it is in
epoch format. Furthermore, don’t use regex which is a heavy machinery
for simply checking if string ends in a positive integer.

PR author checklist:

Added changelog entry, using unclog.
Added tests.
Linked to GitHub issue.
Updated code comments and documentation (e.g., docs/).
Tagged one reviewer who will be the one responsible for shepherding this PR.

Reviewer checklist:

Reviewed Files changed in the GitHub PR explorer.
Manually tested (in case integration/unit/mock tests are absent).

codecov · 2023-07-03T11:33:24Z

Codecov Report

Patch coverage: 59.09% and project coverage change: -0.01% ⚠️

Comparison is base (1be6699) 71.22% compared to head (e809fd2) 71.22%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #725      +/-   ##
==========================================
- Coverage   71.22%   71.22%   -0.01%     
==========================================
  Files         124      124              
  Lines       14790    14782       -8     
==========================================
- Hits        10534    10528       -6     
+ Misses       4256     4254       -2

Files Changed	Coverage Δ
crates/ibc/src/core/ics24_host/identifier.rs	`57.28% <59.09%> (-0.69%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mina86 · 2023-07-08T23:21:43Z

@plafer, friendly ping.

Farhad-Shabani · 2023-07-13T14:16:39Z

@mina86 apologies for the delay in getting back to you. We have been looking for an appropriate time to carefully re-examine the details. Specifically, Under ics24_host/identifier/validate.rs there are some functions that could be modified and used for ChainId as well. Wanted to ensure we're on the right track. Though, we still may proceed in a timely manner. Hope you won't be blocked :)

Also, I’ve summarized your points that haven’t been tackled yet, plus some others we came across in issue #761 for our reference and the unclog. Since the new changes would greatly differ from the existing PRs, I'll open a new one. I hope that's okay with you. Let us know if there's anything we may have overlooked.
Again, much appreciated for bringing these up 🙏

mina86 · 2023-07-13T14:43:05Z

Want me to drop this and #698 PRs? I think this PR might still be valid as it addresses some of the issues and refactors code to form where further changes will be easier to make.

Farhad-Shabani · 2023-07-13T15:23:01Z

Want me to drop this and #698 PRs? I think this PR might still be valid as it addresses some of the issues and refactors code to form where further changes will be easier to make.

I've opened #762. Though still is a work in progress, you can take a look at changes there, which are not in line with the mentioned PRs. Btw, I suggest keep them open till we make sure everything is sorted in #762. If not, we can come back later.

Farhad-Shabani

@mina86 Let’s get this PR done first :)

Left some comments.
Also, can you add a changelog entry for this under .changelog/unreleased/improvements?

crates/ibc/src/core/ics24_host/identifier.rs

Farhad-Shabani · 2023-07-17T18:45:57Z

crates/ibc/src/core/ics24_host/identifier.rs

+    /// non-empty and version is a positive integer starting with a non-zero
+    /// digit.  If `chain_id` is in that format, returns `(name, version)`;
+    /// otherwise returns an error.
+    fn split_chain_id(chain_id: &str) -> Result<(&str, &str), ()> {


Here would be ideal to return of a Result<(&str, u64), IdentifierError> type, so then:

Can utilize parse() method and avoid manually parsing versions.

Access to a u64 wherever is_split_chain called without additional version conversion, even though it adds a bit of overhead.

Returns the IdentifierError as a known type to the system.

I’d rather not do this in this PR to be honest. Having this function parse the version changes the behaviour of is_epoch_format which I’d rather do in separate commit so this one can go in without having to worry about change in behaviour.

(1) Regarding the error type, let's stick to using IdentifierError anywhere that would return an identifier error. So, handling/propagating errors would remain consistent across ibc-rs.

(2) About the version parsing thing, mainly meant to keep things simple which got me thinking more... Are we sure changing is_epoch_format is worth all this added complexity?
Ran a quick local benchmark to compare performance between regex and non-regex approaches. This is on the order of ~1 microsecond! Honestly, we're not gaining much b/c of that. It's much better to prioritize readability/maintainability instead of optimization. Particularly, I noticed it still considers e.g. . -1 and comsos hub-1 (with whitespace) valid! Let's keep things simple and as is for is_epoch_format in this PR.

Btw, still through this PR, the redundant call to is_epoch_format in from_string would be removed

To me 1µs to parse handful of bytes does sound like a lot of time especially as this code is going to be used in smart contracts when execution time and code size (I have another change which removes safe-regex dependency completely) are more expensive than on bare metal.

Regarding other error checking that can happen in follow up changes. Your #762 seems like it going into that direction and if there are other existing functions checking if identifiers consist of valid characters those will be able to be reused.

To me 1µs to parse handful of bytes does sound like a lot of time especially as this code is going to be used in smart contracts when execution time and code size (I have another change which removes safe-regex dependency completely) are more expensive than on bare metal.

Totally see your point! Though, given the costs the entire IBC messaging process already has, this one is not honestly a major one, particularly when it comes to (1)correctness and (2)maintainability, I still strongly value them above (3)optimization here. Btw, feel free to share more context if this really matters to your project.

Regarding other error checking that can happen in follow up changes. Your #762 seems like it going into that direction and if there are other existing functions checking if identifiers consist of valid characters those will be able to be reused.

Hmm, note that #762 vs this PR has different views and even implementation. Here, we're dropping regex b/c of optimization vs there, a part of regex might be discarded b/c of fixing the two bugs noted in #761 (correctness) and delegating these checks to an existing function (maintainability) so all identifiers get checked consistently too.

Anyhow, to not mess with the mentioned checks right after this PR and see a desire to leave with rsplit_once :)

I suggest we port here part of 762 modifications related to checks in split_chain_id:
Basically, the validate_identifier should be broken down. One for checking the characters say validate_identifier_chars, and the other would handle length verification say validate_identifier_length (like here). Then, we can use validate_identifier_chars to ensure the chain name meets ICS-24 spec and maintain identifier checks consistent across IBC-rs.

Though, for the version string check, still believe regex and a parse() afterward suffice.

And, much appreciated to include a unclog 😊

I suggest we port here part of imp(ics24)!: enhancements and fixes to ChainId impl and validation #762 modifications related to checks in split_chain_id: Basically, the validate_identifier

Though, for the version string check, still believe regex and a parse() afterward suffice.

But notice that at that point regular expressions won’t simplify things. rsplit_once can split identifier into name and version, validate_identifier can validate the name, parse can validate the version and we only need check that name doesn’t end with a dash¹ and version doesn’t start with a zero:

fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> { let (name, version) = chain_id.rsplit_once('-').ok_or(NotEpochFormat)?; validate_identifier(name, 6, 43)?; if name.last() == Some(&b'-') || version.first() == Some(&b'0') { return Err(NotEpochFormat); } let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?.get(); Ok((name, version)) }

And because safe-regex doesn’t work on strings, the same function using regex becomes longer due to additional &[u8] → str conversion:

fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> { let (name, version) = regex!(br"(.*[^-])-([1-9][0-9]*)") .match_slices(chain_id.as_bytes()) .ok_or_else(|| NotEpochFormat)?; let name = unsafe { core::str::from_utf8_unchecked(name) }; validate_identifier(name, 6, 43)?; let version = unsafe { core::str::from_utf8_unchecked(version) }; let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?; Ok((name, version.get())) }

¹ To be honest, since validate_identifier allows identifiers ending with a dash, I’d even argue that chain--1 should be considered valid chain identifier.

And because safe-regex doesn’t work on strings, the same function using regex becomes longer due to additional &[u8] → str conversion:
let (name, version) = regex!(br"(.[^-])-([1-9][0-9])")

I meant using regex only for checking the version after extracting its string by rsplit_once.

fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> { let (name, version) = chain_id.rsplit_once('-').ok_or(NotEpochFormat)?; validate_identifier(name, 6, 43)?; if name.last() == Some(&b'-') || version.first() == Some(&b'0') { return Err(NotEpochFormat); } let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?.get(); Ok((name, version)) }

Nevertheless, I think the above function works with a bit of adjustment:

¹ To be honest, since validate_identifier allows identifiers ending with a dash, I’d even argue that chain--1 should be considered valid chain identifier.

On the same page with you. name.last() == Some(&b'-') not needed. 👍🏻

version doesn’t start with a zero

True. should have this check. Though, note that version == 0 is valid but others starting with zero no.

Err(NotEpochFormat)

For the errors, the IdentifierError type should be used

validate_identifier(name, 6, 43)?;

Min length == 3. Right?

On the same page with you. name.last() == Some(&b'-') not needed.

Note though that the check is in the current code:

pub fn is_epoch_format(chain_id: &str) -> bool { let re = safe_regex::regex!(br".*[^-]-[1-9][0-9]*"); // ^^^^ re.is_match(chain_id.as_bytes()) }

version doesn’t start with a zero

True. should have this check. Though, note that version == 0 is valid but others starting with zero no.

Current code rejects zero:

let re = safe_regex::regex!(br".*[^-]-[1-9][0-9]*"); // ^^^^ re.is_match(chain_id.as_bytes())

It is true that ChainId::new will happily create such identifiers though.

validate_identifier(name, 6, 43)?;

Min length == 3. Right?

~~I’ve looked at ICS 24 and it suggests default range for channel identifier to be [8, 64]. Adjusting it for a dash and version which together take from 2 to 21 characters, I came up with [6, 43].~~ [Never mind, it’s channel identifier not chain identifier so perhaps it doesn’t apply?] If we look at the current code it’s more like minimum = 1. I don’t have an opinion. [6, 43] was just for demonstration.

Anyway, if you’re ok moving away from regex, I would suggest getting this PR merged and than I can pick up #698 again with more validation. There we can discuss all the rules that identifiers need to follow and how the API should behave when encountering invalid identifiers. There is a lot of questions to consider (which is why I want to do this cleanup first before dealing validation).

If we look at the current code it’s more like minimum = 1. I don’t have an opinion. [6, 43] was just for demonstration.

3 as the min acceptable length (a letter for the chain name + - + a number-letter) is ok.

Note though that the check is in the current code:

Current code rejects zero:

Exactly, These are the reasons If regex needs to be refactored, it should be done because of correctness, not optimization!

Anyway, if you’re ok moving away from regex, I would suggest getting this PR merged and than I can pick up #698 again with more validation.

Sorry, but regex is preferred over the current split_chain_id. It does not make sense to refactor things again right after this PR.
If we really want to move away from regex here, I am ok with the discussed function above, which is primarily defined to handle the validation stuff.

crates/ibc/src/core/ics24_host/identifier.rs

from_string checks if argument is in epoch format and then calls chain_version which does it again. chain_version and with_version use regex to check if string is in epoch format and then not only parse the string again but unnecessarily collect dash-separated parts into a new vector. Optimise it by providing split_chain_id method which checks the format and returns individual name and version parts of the id if it is in epoch format. Furthermore, don’t use regex which is a heavy machinery for simply checking if string ends in a positive integer.

Farhad-Shabani · 2023-07-28T18:41:43Z

Closing this, as we ended up fixing the ChainId issues all at once in #762, given our current bandwidth and priorities. Really appreciate your contribution 🙏

mina86 force-pushed the b branch from 9ef6ba0 to 7f64901 Compare July 3, 2023 11:28

Farhad-Shabani mentioned this pull request Jul 13, 2023

[ICS24] Enhancements and fixes for ChainId handling and validation #761

Closed

Farhad-Shabani requested changes Jul 17, 2023

View reviewed changes

mina86 force-pushed the b branch 2 times, most recently from 7968887 to 5fe7923 Compare July 17, 2023 19:27

mina86 requested a review from Farhad-Shabani July 17, 2023 23:03

mina86 force-pushed the b branch from 5fe7923 to e809fd2 Compare July 26, 2023 11:31

Farhad-Shabani closed this Jul 28, 2023

mina86 deleted the b branch July 28, 2023 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise ChainId epoch format parsing #725

Optimise ChainId epoch format parsing #725

mina86 commented Jun 21, 2023

codecov bot commented Jul 3, 2023 •

edited

Loading

mina86 commented Jul 8, 2023

Farhad-Shabani commented Jul 13, 2023

mina86 commented Jul 13, 2023

Farhad-Shabani commented Jul 13, 2023

Farhad-Shabani left a comment

Farhad-Shabani Jul 17, 2023

mina86 Jul 17, 2023

Farhad-Shabani Jul 18, 2023 •

edited

Loading

mina86 Jul 26, 2023

Farhad-Shabani Jul 26, 2023

mina86 Jul 26, 2023

Farhad-Shabani Jul 26, 2023

mina86 Jul 26, 2023 •

edited

Loading

Farhad-Shabani Jul 26, 2023 •

edited

Loading

Farhad-Shabani commented Jul 28, 2023

Optimise ChainId epoch format parsing #725

Optimise ChainId epoch format parsing #725

Conversation

mina86 commented Jun 21, 2023

PR author checklist:

Reviewer checklist:

codecov bot commented Jul 3, 2023 • edited Loading

Codecov Report

mina86 commented Jul 8, 2023

Farhad-Shabani commented Jul 13, 2023

mina86 commented Jul 13, 2023

Farhad-Shabani commented Jul 13, 2023

Farhad-Shabani left a comment

Choose a reason for hiding this comment

Farhad-Shabani Jul 17, 2023

Choose a reason for hiding this comment

mina86 Jul 17, 2023

Choose a reason for hiding this comment

Farhad-Shabani Jul 18, 2023 • edited Loading

Choose a reason for hiding this comment

mina86 Jul 26, 2023

Choose a reason for hiding this comment

Farhad-Shabani Jul 26, 2023

Choose a reason for hiding this comment

mina86 Jul 26, 2023

Choose a reason for hiding this comment

Farhad-Shabani Jul 26, 2023

Choose a reason for hiding this comment

mina86 Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Farhad-Shabani Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

Farhad-Shabani commented Jul 28, 2023

codecov bot commented Jul 3, 2023 •

edited

Loading

Farhad-Shabani Jul 18, 2023 •

edited

Loading

mina86 Jul 26, 2023 •

edited

Loading

Farhad-Shabani Jul 26, 2023 •

edited

Loading