-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise ChainId epoch format parsing #725
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #725 +/- ##
==========================================
- Coverage 71.22% 71.22% -0.01%
==========================================
Files 124 124
Lines 14790 14782 -8
==========================================
- Hits 10534 10528 -6
+ Misses 4256 4254 -2
☔ View full report in Codecov by Sentry. |
@plafer, friendly ping. |
@mina86 apologies for the delay in getting back to you. We have been looking for an appropriate time to carefully re-examine the details. Specifically, Under Also, I’ve summarized your points that haven’t been tackled yet, plus some others we came across in issue #761 for our reference and the |
Want me to drop this and #698 PRs? I think this PR might still be valid as it addresses some of the issues and refactors code to form where further changes will be easier to make. |
I've opened #762. Though still is a work in progress, you can take a look at changes there, which are not in line with the mentioned PRs. Btw, I suggest keep them open till we make sure everything is sorted in #762. If not, we can come back later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mina86 Let’s get this PR done first :)
Left some comments.
Also, can you add a changelog entry for this under .changelog/unreleased/improvements?
/// non-empty and version is a positive integer starting with a non-zero | ||
/// digit. If `chain_id` is in that format, returns `(name, version)`; | ||
/// otherwise returns an error. | ||
fn split_chain_id(chain_id: &str) -> Result<(&str, &str), ()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here would be ideal to return of a Result<(&str, u64), IdentifierError>
type, so then:
- Can utilize
parse()
method and avoid manually parsing versions. - Access to a
u64
whereveris_split_chain
called without additional version conversion, even though it adds a bit of overhead. - Returns the
IdentifierError
as a known type to the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d rather not do this in this PR to be honest. Having this function parse the version changes the behaviour of is_epoch_format
which I’d rather do in separate commit so this one can go in without having to worry about change in behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) Regarding the error type, let's stick to using IdentifierError
anywhere that would return an identifier error. So, handling/propagating errors would remain consistent across ibc-rs
.
(2) About the version parsing thing, mainly meant to keep things simple which got me thinking more... Are we sure changing is_epoch_format
is worth all this added complexity?
Ran a quick local benchmark to compare performance between regex and non-regex approaches. This is on the order of ~1 microsecond! Honestly, we're not gaining much b/c of that. It's much better to prioritize readability/maintainability instead of optimization. Particularly, I noticed it still considers e.g. . -1
and comsos hub-1
(with whitespace) valid! Let's keep things simple and as is for is_epoch_format
in this PR.
Btw, still through this PR, the redundant call to is_epoch_format
in from_string
would be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me 1µs to parse handful of bytes does sound like a lot of time especially as this code is going to be used in smart contracts when execution time and code size (I have another change which removes safe-regex dependency completely) are more expensive than on bare metal.
Regarding other error checking that can happen in follow up changes. Your #762 seems like it going into that direction and if there are other existing functions checking if identifiers consist of valid characters those will be able to be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me 1µs to parse handful of bytes does sound like a lot of time especially as this code is going to be used in smart contracts when execution time and code size (I have another change which removes safe-regex dependency completely) are more expensive than on bare metal.
Totally see your point! Though, given the costs the entire IBC messaging process already has, this one is not honestly a major one, particularly when it comes to (1)correctness and (2)maintainability, I still strongly value them above (3)optimization here. Btw, feel free to share more context if this really matters to your project.
Regarding other error checking that can happen in follow up changes. Your #762 seems like it going into that direction and if there are other existing functions checking if identifiers consist of valid characters those will be able to be reused.
Hmm, note that #762 vs this PR has different views and even implementation. Here, we're dropping regex b/c of optimization vs there, a part of regex might be discarded b/c of fixing the two bugs noted in #761 (correctness) and delegating these checks to an existing function (maintainability) so all identifiers get checked consistently too.
Anyhow, to not mess with the mentioned checks right after this PR and see a desire to leave with rsplit_once
:)
-
I suggest we port here part of 762 modifications related to checks in
split_chain_id
:
Basically, thevalidate_identifier
should be broken down. One for checking the characters sayvalidate_identifier_chars
, and the other would handle length verification sayvalidate_identifier_length
(like here). Then, we can usevalidate_identifier_chars
to ensure the chain name meets ICS-24 spec and maintain identifier checks consistent acrossIBC-rs
. -
Though, for the version string check, still believe regex and a
parse()
afterward suffice. -
And, much appreciated to include a
unclog
😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I suggest we port here part of imp(ics24)!: enhancements and fixes to
ChainId
impl and validation #762 modifications related to checks insplit_chain_id
: Basically, thevalidate_identifier
- Though, for the version string check, still believe regex and a
parse()
afterward suffice.
But notice that at that point regular expressions won’t simplify things. rsplit_once can split identifier into name and version, validate_identifier can validate the name, parse can validate the version and we only need check that name doesn’t end with a dash¹ and version doesn’t start with a zero:
fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> {
let (name, version) = chain_id.rsplit_once('-').ok_or(NotEpochFormat)?;
validate_identifier(name, 6, 43)?;
if name.last() == Some(&b'-') || version.first() == Some(&b'0') {
return Err(NotEpochFormat);
}
let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?.get();
Ok((name, version))
}
And because safe-regex doesn’t work on strings, the same function using regex becomes longer due to additional &[u8]
→ str
conversion:
fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> {
let (name, version) = regex!(br"(.*[^-])-([1-9][0-9]*)")
.match_slices(chain_id.as_bytes())
.ok_or_else(|| NotEpochFormat)?;
let name = unsafe { core::str::from_utf8_unchecked(name) };
validate_identifier(name, 6, 43)?;
let version = unsafe { core::str::from_utf8_unchecked(version) };
let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?;
Ok((name, version.get()))
}
¹ To be honest, since validate_identifier allows identifiers ending with a dash, I’d even argue that chain--1
should be considered valid chain identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And because safe-regex doesn’t work on strings, the same function using regex becomes longer due to additional &[u8] → str conversion:
let (name, version) = regex!(br"(.[^-])-([1-9][0-9])")
I meant using regex only for checking the version after extracting its string by rsplit_once
.
fn split_chain_id(chain_id: &str) -> Result<(&str, u64), _> { let (name, version) = chain_id.rsplit_once('-').ok_or(NotEpochFormat)?; validate_identifier(name, 6, 43)?; if name.last() == Some(&b'-') || version.first() == Some(&b'0') { return Err(NotEpochFormat); } let version = NonZeroU64::from_str(version).map_err(|_| NotEpochFormat)?.get(); Ok((name, version)) }
Nevertheless, I think the above function works with a bit of adjustment:
¹ To be honest, since validate_identifier allows identifiers ending with a dash, I’d even argue that chain--1 should be considered valid chain identifier.
On the same page with you. name.last() == Some(&b'-')
not needed. 👍🏻
version doesn’t start with a zero
True. should have this check. Though, note that version == 0
is valid but others starting with zero no.
Err(NotEpochFormat)
For the errors, the IdentifierError
type should be used
validate_identifier(name, 6, 43)?;
Min length == 3. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same page with you.
name.last() == Some(&b'-')
not needed.
Note though that the check is in the current code:
pub fn is_epoch_format(chain_id: &str) -> bool {
let re = safe_regex::regex!(br".*[^-]-[1-9][0-9]*");
// ^^^^
re.is_match(chain_id.as_bytes())
}
version doesn’t start with a zero
True. should have this check. Though, note that
version == 0
is valid but others starting with zero no.
Current code rejects zero:
let re = safe_regex::regex!(br".*[^-]-[1-9][0-9]*");
// ^^^^
re.is_match(chain_id.as_bytes())
It is true that ChainId::new will happily create such identifiers though.
validate_identifier(name, 6, 43)?;
Min length == 3. Right?
I’ve looked at ICS 24 and it suggests default range for channel identifier to be [8, 64]. Adjusting it for a dash and version which together take from 2 to 21 characters, I came up with [6, 43]. [Never mind, it’s channel identifier not chain identifier so perhaps it doesn’t apply?] If we look at the current code it’s more like minimum = 1. I don’t have an opinion. [6, 43] was just for demonstration.
Anyway, if you’re ok moving away from regex, I would suggest getting this PR merged and than I can pick up #698 again with more validation. There we can discuss all the rules that identifiers need to follow and how the API should behave when encountering invalid identifiers. There is a lot of questions to consider (which is why I want to do this cleanup first before dealing validation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we look at the current code it’s more like minimum = 1. I don’t have an opinion. [6, 43] was just for demonstration.
3 as the min acceptable length (a letter for the chain name + -
+ a number-letter) is ok.
Note though that the check is in the current code:
Current code rejects zero:
Exactly, These are the reasons If regex needs to be refactored, it should be done because of correctness, not optimization!
Anyway, if you’re ok moving away from regex, I would suggest getting this PR merged and than I can pick up #698 again with more validation.
Sorry, but regex is preferred over the current split_chain_id
. It does not make sense to refactor things again right after this PR.
If we really want to move away from regex here, I am ok with the discussed function above, which is primarily defined to handle the validation stuff.
7968887
to
5fe7923
Compare
from_string checks if argument is in epoch format and then calls chain_version which does it again. chain_version and with_version use regex to check if string is in epoch format and then not only parse the string again but unnecessarily collect dash-separated parts into a new vector. Optimise it by providing split_chain_id method which checks the format and returns individual name and version parts of the id if it is in epoch format. Furthermore, don’t use regex which is a heavy machinery for simply checking if string ends in a positive integer.
Closing this, as we ended up fixing the |
from_string checks if argument is in epoch format and then calls
chain_version which does it again. chain_version and with_version use
regex to check if string is in epoch format and then not only parse
the string again but unnecessarily collect dash-separated parts into
a new vector.
Optimise it by providing split_chain_id method which checks the format
and returns individual name and version parts of the id if it is in
epoch format. Furthermore, don’t use regex which is a heavy machinery
for simply checking if string ends in a positive integer.
PR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.