-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the handling of multibyte characters #2051
Fix the handling of multibyte characters #2051
Conversation
When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets. This commit fixes the issue by: * Modifying document referencing to properly handle multibyte characters when mapping between positions and offsets * Adding test cases to verify go to definition work with multibyte characters
cc @kddnewton in case you have any views on the Prism usage here. |
sig { params(location: Prism::Location, position: T.untyped).returns(T::Boolean) } | ||
def cover?(location, position) | ||
sig { params(location: Prism::Location, position: T.untyped, encoding: Encoding).returns(T::Boolean) } | ||
def cover?(location, position, encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pass the encoding to the Request
initializer, and save it as an instance variable, so that we don't need to pass it around in methods as much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your quickly review.
For now, the requests have no common attributes. We only need encoding for Requests::Hover
and Requests::Definition
, perhaps. So, I think it's a bit too much for the request initializer to have an encoding attribute.
However, if it’s global_state, it might be okay since it’s already used by multiple requests. Alternatively, we can pass the global_state to the request initializer instead of encoding. If it is global_state, it wouldn't be odd to maintain the state within the request, in my opinion. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andyw8 @vinistock
Sorry for the delay. Upon reconsideration, I believe it might be better not to have the Request initializer hold encoding/global_state in this PR. Only two subclasses currently need these attributes. Making this change would lead to a large scope of modifications. What do you think?
@@ -142,16 +142,20 @@ def locate(node, char_position, node_types: []) | |||
|
|||
# Skip if the current node doesn't cover the desired position | |||
loc = candidate.location | |||
next unless (loc.start_offset...loc.end_offset).cover?(char_position) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_offset
and end_offset
are a lot less costly to calculate than start_code_units_offset
and end_code_units_offset
. Could we instead convert char_position
into a byte offset and keep the existing code the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments.
In my understanding, the char_position is provided by the editor, so we can't convert it into a byte offset :(
Alternatively, we can memorize the loc.start_code_units_offset and others to reduce the calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the variable though, so we can convert it using our own transformation right? I think the general transformation would be:
- UTF-8:
source.slice(0, code_units_offset).length
- UTF-16:
code_units_offset / 2
- UTF-32:
code_units_offset / 4
I'm not 100% sure, but I think that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kddnewton
Apologies for the delayed response. I have fixed the issue to convert byte offsets in this commit: be648a6. Please review it.
start_covered = | ||
location.start_line - 1 < position[:line] || | ||
( | ||
location.start_line - 1 == position[:line] && | ||
location.start_column <= position[:character] | ||
location.start_code_units_column(encoding) <= position[:character] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above here, could we convert position[:character]
into a byte offset before we check cover?
here?
It appears that from 'be648a6' onwards, it no longer works correctly on VSCode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to convert the editor's position to a byte offset is interesting from a performance perspective, but I'm not sure it's the right approach. It seems reversed in my opinion.
For example, if we fix the locations in the indexer to use code units (which is required for us to return selection ranges properly), then implementing features like find references and rename could prove a bit weird because we would need to compare a byte offset with a code unit.
Or we would need to not apply the same conversion only in those cases.
I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.
It's also relevant to mention that the code unit APIs are only less performant when there are unicode characters, since they require more complicated handling. For ASCII only sources, the performance should be pretty much the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look style related only. Can we remove them to make it easier to review?
@vinistock
It sounds good to me. I'll reset or revert the commits related to byte offsets (be648a6..ff4d196). |
ff4d196
to
052d674
Compare
This pull request is being marked as stale because there was no activity in the last 2 months |
@vinistock |
@NotFounds Hi! I'm sorry, I got caught up with other priorities and dropped the ball on this one. Let's move forward with this PR before the resource URI changes since this fixes actual issues and the resource URI approach is mostly a correctness refactor. Can you please re-open this PR (or a new one, whatever is easier) and I'll help push it over the finish line? Let's take the approach of saving the code units with the new Prism API in the index. So essentially, we need to:
|
@vinistock I have a few questions:
|
Yes, that's correct. Let's pass the entire configuration object to the declaration listener. That way, if more configurations are added, it will already be able to access all of them.
It's unfortunately a trade off of performance and correctness. If you have as much as a single multibyte character in a document, the entire source must be considered multibyte and computing the locations is significantly more expensive. If you don't have any multibyte characters, then Prism uses an ASCII-only optimization which is much much faster. However, from a correctness standpoint, we need to be able to index declarations made using multibyte characters, especially for languages that need these characters like Japanese. Also, having the correct code unit locations means that all features that depend on index entries will just work (hover, definition, completion, signature help, workspace symbol). |
Motivation
refs: #2021 (comment)
When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets.
This commit fixes the issue by:
Implementation
Automated Tests
I added some tests for the definition and
RubyLsp::RubyDocument
.Manual Tests
The definition jump for the following code snippet from #1347 works in this branch, but does not function correctly in the main branch.