Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise ChainId epoch format parsing #725

Closed
wants to merge 1 commit into from
Closed

Conversation

mina86
Copy link
Contributor

@mina86 mina86 commented Jun 21, 2023

from_string checks if argument is in epoch format and then calls
chain_version which does it again. chain_version and with_version use
regex to check if string is in epoch format and then not only parse
the string again but unnecessarily collect dash-separated parts into
a new vector.

Optimise it by providing split_chain_id method which checks the format
and returns individual name and version parts of the id if it is in
epoch format. Furthermore, don’t use regex which is a heavy machinery
for simply checking if string ends in a positive integer.


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests.
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).
  • Tagged one reviewer who will be the one responsible for shepherding this PR.

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Patch coverage: 59.09% and project coverage change: -0.01% ⚠️

Comparison is base (1be6699) 71.22% compared to head (e809fd2) 71.22%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #725      +/-   ##
==========================================
- Coverage   71.22%   71.22%   -0.01%     
==========================================
  Files         124      124              
  Lines       14790    14782       -8     
==========================================
- Hits        10534    10528       -6     
+ Misses       4256     4254       -2     
Files Changed Coverage Δ
crates/ibc/src/core/ics24_host/identifier.rs 57.28% <59.09%> (-0.69%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mina86
Copy link
Contributor Author

mina86 commented Jul 8, 2023

@plafer, friendly ping.

@Farhad-Shabani
Copy link
Member

@mina86 apologies for the delay in getting back to you. We have been looking for an appropriate time to carefully re-examine the details. Specifically, Under ics24_host/identifier/validate.rs there are some functions that could be modified and used for ChainId as well. Wanted to ensure we're on the right track. Though, we still may proceed in a timely manner. Hope you won't be blocked :)

Also, I’ve summarized your points that haven’t been tackled yet, plus some others we came across in issue #761 for our reference and the unclog. Since the new changes would greatly differ from the existing PRs, I'll open a new one. I hope that's okay with you. Let us know if there's anything we may have overlooked.
Again, much appreciated for bringing these up 🙏

@mina86
Copy link
Contributor Author

mina86 commented Jul 13, 2023

Want me to drop this and #698 PRs? I think this PR might still be valid as it addresses some of the issues and refactors code to form where further changes will be easier to make.

@Farhad-Shabani
Copy link
Member

Want me to drop this and #698 PRs? I think this PR might still be valid as it addresses some of the issues and refactors code to form where further changes will be easier to make.

I've opened #762. Though still is a work in progress, you can take a look at changes there, which are not in line with the mentioned PRs. Btw, I suggest keep them open till we make sure everything is sorted in #762. If not, we can come back later.

Copy link
Member

@Farhad-Shabani Farhad-Shabani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mina86 Let’s get this PR done first :)

Left some comments.
Also, can you add a changelog entry for this under .changelog/unreleased/improvements?

crates/ibc/src/core/ics24_host/identifier.rs Outdated Show resolved Hide resolved
crates/ibc/src/core/ics24_host/identifier.rs Outdated Show resolved Hide resolved
/// non-empty and version is a positive integer starting with a non-zero
/// digit. If `chain_id` is in that format, returns `(name, version)`;
/// otherwise returns an error.
fn split_chain_id(chain_id: &str) -> Result<(&str, &str), ()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here would be ideal to return of a Result<(&str, u64), IdentifierError> type, so then:

  • Can utilize parse() method and avoid manually parsing versions.
  • Access to a u64 wherever is_split_chain called without additional version conversion, even though it adds a bit of overhead.
  • Returns the IdentifierError as a known type to the system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d rather not do this in this PR to be honest. Having this function parse the version changes the behaviour of is_epoch_format which I’d rather do in separate commit so this one can go in without having to worry about change in behaviour.

Copy link
Member

@Farhad-Shabani Farhad-Shabani Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) Regarding the error type, let's stick to using IdentifierError anywhere that would return an identifier error. So, handling/propagating errors would remain consistent across ibc-rs.

(2) About the version parsing thing, mainly meant to keep things simple which got me thinking more... Are we sure changing is_epoch_format is worth all this added complexity?
Ran a quick local benchmark to compare performance between regex and non-regex approaches. This is on the order of ~1 microsecond! Honestly, we're not gaining much b/c of that. It's much better to prioritize readability/maintainability instead of optimization. Particularly, I noticed it still considers e.g. . -1 and comsos hub-1 (with whitespace) valid! Let's keep things simple and as is for is_epoch_format in this PR.

Btw, still through this PR, the redundant call to is_epoch_format in from_string would be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me 1µs to parse handful of bytes does sound like a lot of time especially as this code is going to be used in smart contracts when execution time and code size (I have another change which removes safe-regex dependency completely) are more expensive than on bare metal.

Regarding other error checking that can happen in follow up changes. Your #762 seems like it going into that direction and if there are other existing functions checking if identifiers consist of valid characters those will be able to be reused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me 1µs to parse handful of bytes does sound like a lot of time especially as this code is going to be used in smart contracts when execution time and code size (I have another change which removes safe-regex dependency completely) are more expensive than on bare metal.

Totally see your point! Though, given the costs the entire IBC messaging process already has, this one is not honestly a major one, particularly when it comes to (1)correctness and (2)maintainability, I still strongly value them above (3)optimization here. Btw, feel free to share more context if this really matters to your project.

Regarding other error checking that can happen in follow up changes. Your #762 seems like it going into that direction and if there are other existing functions checking if identifiers consist of valid characters those will be able to be reused.

Hmm, note that #762 vs this PR has different views and even implementation. Here, we're dropping regex b/c of optimization vs there, a part of regex might be discarded b/c of fixing the two bugs noted in #761 (correctness) and delegating these checks to an existing function (maintainability) so all identifiers get checked consistently too.

Anyhow, to not mess with the mentioned checks right after this PR and see a desire to leave with rsplit_once :)

  1. I suggest we port here part of 762 modifications related to checks in split_chain_id:
    Basically, the validate_identifier should be broken down. One for checking the characters say validate_identifier_chars, and the other would handle length verification say validate_identifier_length (like here). Then, we can use validate_identifier_chars to ensure the chain name meets ICS-24 spec and maintain identifier checks consistent across IBC-rs.

  2. Though, for the version string check, still believe regex and a parse() afterward suffice.

  3. And, much appreciated to include a unclog 😊

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I suggest we port here part of imp(ics24)!: enhancements and fixes to ChainId impl and validation #762 modifications related to checks in split_chain_id: Basically, the validate_identifier
  2. Though, for the version string check, still believe regex and a parse() afterward suffice.

But notice that at that point regular expressions won’t simplify things. rsplit_once can split identifier into name and version, validate_identifier can validate the name, parse can validate the version and we only need check that name doesn’t end with a dash¹ and version doesn’t start with a zero:

fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> {
    let (name, version) = chain_id.rsplit_once('-').ok_or(NotEpochFormat)?;
    validate_identifier(name, 6, 43)?;
    if name.last() == Some(&b'-') || version.first() == Some(&b'0') {
        return Err(NotEpochFormat);
    }
    let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?.get();
    Ok((name, version))
}

And because safe-regex doesn’t work on strings, the same function using regex becomes longer due to additional &[u8]str conversion:

fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> {
    let (name, version) = regex!(br"(.*[^-])-([1-9][0-9]*)")
        .match_slices(chain_id.as_bytes())
        .ok_or_else(|| NotEpochFormat)?;
    let name = unsafe { core::str::from_utf8_unchecked(name) };
    validate_identifier(name, 6, 43)?;
    let version = unsafe { core::str::from_utf8_unchecked(version) };
    let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?;
    Ok((name, version.get()))
}

¹ To be honest, since validate_identifier allows identifiers ending with a dash, I’d even argue that chain--1 should be considered valid chain identifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And because safe-regex doesn’t work on strings, the same function using regex becomes longer due to additional &[u8] → str conversion:
let (name, version) = regex!(br"(.[^-])-([1-9][0-9])")

I meant using regex only for checking the version after extracting its string by rsplit_once.

fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> {
    let (name, version) = chain_id.rsplit_once('-').ok_or(NotEpochFormat)?;
    validate_identifier(name, 6, 43)?;
    if name.last() == Some(&b'-') || version.first() == Some(&b'0') {
        return Err(NotEpochFormat);
    }
    let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?.get();
    Ok((name, version))
}

Nevertheless, I think the above function works with a bit of adjustment:

¹ To be honest, since validate_identifier allows identifiers ending with a dash, I’d even argue that chain--1 should be considered valid chain identifier.

On the same page with you. name.last() == Some(&b'-') not needed. 👍🏻

version doesn’t start with a zero

True. should have this check. Though, note that version == 0 is valid but others starting with zero no.

Err(NotEpochFormat)

For the errors, the IdentifierError type should be used

validate_identifier(name, 6, 43)?;

Min length == 3. Right?

Copy link
Contributor Author

@mina86 mina86 Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the same page with you. name.last() == Some(&b'-') not needed.

Note though that the check is in the current code:

pub fn is_epoch_format(chain_id: &str) -> bool {
    let re = safe_regex::regex!(br".*[^-]-[1-9][0-9]*");
//                                   ^^^^
    re.is_match(chain_id.as_bytes())
}

version doesn’t start with a zero

True. should have this check. Though, note that version == 0 is valid but others starting with zero no.

Current code rejects zero:

    let re = safe_regex::regex!(br".*[^-]-[1-9][0-9]*");
//                                         ^^^^
    re.is_match(chain_id.as_bytes())

It is true that ChainId::new will happily create such identifiers though.

validate_identifier(name, 6, 43)?;

Min length == 3. Right?

I’ve looked at ICS 24 and it suggests default range for channel identifier to be [8, 64]. Adjusting it for a dash and version which together take from 2 to 21 characters, I came up with [6, 43]. [Never mind, it’s channel identifier not chain identifier so perhaps it doesn’t apply?] If we look at the current code it’s more like minimum = 1. I don’t have an opinion. [6, 43] was just for demonstration.


Anyway, if you’re ok moving away from regex, I would suggest getting this PR merged and than I can pick up #698 again with more validation. There we can discuss all the rules that identifiers need to follow and how the API should behave when encountering invalid identifiers. There is a lot of questions to consider (which is why I want to do this cleanup first before dealing validation).

Copy link
Member

@Farhad-Shabani Farhad-Shabani Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look at the current code it’s more like minimum = 1. I don’t have an opinion. [6, 43] was just for demonstration.

3 as the min acceptable length (a letter for the chain name + - + a number-letter) is ok.

Note though that the check is in the current code:

Current code rejects zero:

Exactly, These are the reasons If regex needs to be refactored, it should be done because of correctness, not optimization!

Anyway, if you’re ok moving away from regex, I would suggest getting this PR merged and than I can pick up #698 again with more validation.

Sorry, but regex is preferred over the current split_chain_id. It does not make sense to refactor things again right after this PR.
If we really want to move away from regex here, I am ok with the discussed function above, which is primarily defined to handle the validation stuff.

crates/ibc/src/core/ics24_host/identifier.rs Outdated Show resolved Hide resolved
@mina86 mina86 force-pushed the b branch 2 times, most recently from 7968887 to 5fe7923 Compare July 17, 2023 19:27
@mina86 mina86 requested a review from Farhad-Shabani July 17, 2023 23:03
from_string checks if argument is in epoch format and then calls
chain_version which does it again.  chain_version and with_version use
regex to check if string is in epoch format and then not only parse
the string again but unnecessarily collect dash-separated parts into
a new vector.

Optimise it by providing split_chain_id method which checks the format
and returns individual name and version parts of the id if it is in
epoch format.  Furthermore, don’t use regex which is a heavy machinery
for simply checking if string ends in a positive integer.
@Farhad-Shabani
Copy link
Member

Closing this, as we ended up fixing the ChainId issues all at once in #762, given our current bandwidth and priorities. Really appreciate your contribution 🙏

@mina86 mina86 deleted the b branch July 28, 2023 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants