Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the handling of multibyte characters #2051

Closed

Conversation

NotFounds
Copy link
Contributor

Motivation

refs: #2021 (comment)

When a Ruby file contains multibyte characters (like Japanese, Chinese, emoji, etc), the go to definition and hover features do not work correctly. Because the document referencing logic does not properly handle multibyte characters when calculating offsets.

This commit fixes the issue by:

  • Modifying document referencing to properly handle multibyte characters when mapping between positions and offsets
  • Adding test cases to verify go to definition work with multibyte characters

Implementation

  1. Modified RubyLsp::Document and RubyLsp::Requests::Request to calculate locations considering multibyte characters. This change utilizes the API implemented in Prism by Add code unit APIs to location ruby/prism#2406.
  2. By passing the encoding when creating the Index, the index will be built taking multibyte characters into account.

Automated Tests

I added some tests for the definition and RubyLsp::RubyDocument.

Manual Tests

The definition jump for the following code snippet from #1347 works in this branch, but does not function correctly in the main branch.

class Test
  TEST = 'test'

  def method1
    # テスト
    pp TEST
  end
end

When a Ruby file contains multibyte characters (like Japanese, Chinese,
emoji, etc), the go to definition and hover features do not work
correctly. Because the document referencing logic does not properly
handle multibyte characters when calculating offsets.

This commit fixes the issue by:
* Modifying document referencing to properly handle multibyte characters
when mapping between positions and offsets
* Adding test cases to verify go to definition work with multibyte
characters
@andyw8
Copy link
Contributor

andyw8 commented May 15, 2024

cc @kddnewton in case you have any views on the Prism usage here.

sig { params(location: Prism::Location, position: T.untyped).returns(T::Boolean) }
def cover?(location, position)
sig { params(location: Prism::Location, position: T.untyped, encoding: Encoding).returns(T::Boolean) }
def cover?(location, position, encoding)
Copy link
Contributor

@andyw8 andyw8 May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pass the encoding to the Request initializer, and save it as an instance variable, so that we don't need to pass it around in methods as much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your quickly review.
For now, the requests have no common attributes. We only need encoding for Requests::Hover and Requests::Definition, perhaps. So, I think it's a bit too much for the request initializer to have an encoding attribute.

However, if it’s global_state, it might be okay since it’s already used by multiple requests. Alternatively, we can pass the global_state to the request initializer instead of encoding. If it is global_state, it wouldn't be odd to maintain the state within the request, in my opinion. What do you think?

Copy link
Contributor Author

@NotFounds NotFounds May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyw8 Since the position is now pre-calculated using byte offsets, there is no longer a need to pass the encoding to each method.

ref: be648a6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyw8 @vinistock
Sorry for the delay. Upon reconsideration, I believe it might be better not to have the Request initializer hold encoding/global_state in this PR. Only two subclasses currently need these attributes. Making this change would lead to a large scope of modifications. What do you think?

@andyw8 andyw8 added bugfix This PR will fix an existing bug server This pull request should be included in the server gem's release notes labels May 15, 2024
@@ -142,16 +142,20 @@ def locate(node, char_position, node_types: [])

# Skip if the current node doesn't cover the desired position
loc = candidate.location
next unless (loc.start_offset...loc.end_offset).cover?(char_position)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_offset and end_offset are a lot less costly to calculate than start_code_units_offset and end_code_units_offset. Could we instead convert char_position into a byte offset and keep the existing code the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.
In my understanding, the char_position is provided by the editor, so we can't convert it into a byte offset :(
Alternatively, we can memorize the loc.start_code_units_offset and others to reduce the calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the variable though, so we can convert it using our own transformation right? I think the general transformation would be:

  • UTF-8: source.slice(0, code_units_offset).length
  • UTF-16: code_units_offset / 2
  • UTF-32: code_units_offset / 4

I'm not 100% sure, but I think that's it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kddnewton
Apologies for the delayed response. I have fixed the issue to convert byte offsets in this commit: be648a6. Please review it.

start_covered =
location.start_line - 1 < position[:line] ||
(
location.start_line - 1 == position[:line] &&
location.start_column <= position[:character]
location.start_code_units_column(encoding) <= position[:character]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above here, could we convert position[:character] into a byte offset before we check cover? here?

lib/ruby_lsp/document.rb Outdated Show resolved Hide resolved
lib/ruby_lsp/test_helper.rb Outdated Show resolved Hide resolved
lib/ruby_lsp/utils.rb Outdated Show resolved Hide resolved
@NotFounds NotFounds requested review from kddnewton and andyw8 June 4, 2024 16:06
@NotFounds NotFounds marked this pull request as draft June 4, 2024 16:26
@NotFounds
Copy link
Contributor Author

It appears that from 'be648a6' onwards, it no longer works correctly on VSCode.
I'll try to fix it. I would appreciate any advice you can provide.

Copy link
Member

@vinistock vinistock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea to convert the editor's position to a byte offset is interesting from a performance perspective, but I'm not sure it's the right approach. It seems reversed in my opinion.

For example, if we fix the locations in the indexer to use code units (which is required for us to return selection ranges properly), then implementing features like find references and rename could prove a bit weird because we would need to compare a byte offset with a code unit.

Or we would need to not apply the same conversion only in those cases.

I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.

It's also relevant to mention that the code unit APIs are only less performant when there are unicode characters, since they require more complicated handling. For ASCII only sources, the performance should be pretty much the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look style related only. Can we remove them to make it easier to review?

lib/ruby_lsp/test_helper.rb Outdated Show resolved Hide resolved
@NotFounds
Copy link
Contributor Author

@vinistock
Thanks for your comment!

I think I'd rather favour consistency of using the code unit APIs everywhere for now and if we find performance issues we can try to address those.

It sounds good to me. I'll reset or revert the commits related to byte offsets (be648a6..ff4d196).
Additionally, I'll look into refactoring the code based on this review comment (#2051 (comment)).

@NotFounds NotFounds force-pushed the fix-handling-of-multibyte-characters branch from ff4d196 to 052d674 Compare June 5, 2024 14:18
@NotFounds NotFounds marked this pull request as ready for review June 18, 2024 08:15
Copy link
Contributor

github-actions bot commented Sep 1, 2024

This pull request is being marked as stale because there was no activity in the last 2 months

@github-actions github-actions bot added the Stale label Sep 1, 2024
@github-actions github-actions bot closed this Sep 15, 2024
@NotFounds
Copy link
Contributor Author

@vinistock
I would appreciate it if you could kindly provide your comments on this PR. Should we wait for the implementation of ResourceUri?

@vinistock
Copy link
Member

vinistock commented Sep 23, 2024

@NotFounds Hi! I'm sorry, I got caught up with other priorities and dropped the ball on this one. Let's move forward with this PR before the resource URI changes since this fixes actual issues and the resource URI approach is mostly a correctness refactor.

Can you please re-open this PR (or a new one, whatever is easier) and I'll help push it over the finish line?

Let's take the approach of saving the code units with the new Prism API in the index. So essentially, we need to:

  1. Configure the index with the encoding that was negotiated between editor and server. After setting the global state encoding here, you want to grab the @index.configuration object and set the encoding there, so that we can check what is the encoding being used during indexing
  2. You then need to pass the encoding (or maybe the entire config object?) to the declaration listener, where we will use the encoding to invoke the Prism location API location.start_code_units_column(encoding) to get the proper locations for multibyte characters
  3. Finally, we should add a few tests to ensure that we don't accidentally regress. One test per entity type should be okay. These would be:

@NotFounds
Copy link
Contributor Author

@vinistock
Thank you for your response. I'm glad we can move forward with the implementation of this feature! Since the changes seem significant, I'll create a new Pull Request.

I have a few questions:

  • In the declaration listener, should I retrieve the encoding using something like @index.configuration.encoding? I planned to include the encoding in the Configuration in step 1.
  • Is considering multibyte code units during indexing primarily for performance reasons? (Just curious)

@vinistock
Copy link
Member

In the declaration listener, should I retrieve the encoding using something like @index.configuration.encoding? I planned to include the encoding in the Configuration in step 1.

Yes, that's correct. Let's pass the entire configuration object to the declaration listener. That way, if more configurations are added, it will already be able to access all of them.

Is considering multibyte code units during indexing primarily for performance reasons? (Just curious)

It's unfortunately a trade off of performance and correctness. If you have as much as a single multibyte character in a document, the entire source must be considered multibyte and computing the locations is significantly more expensive. If you don't have any multibyte characters, then Prism uses an ASCII-only optimization which is much much faster.

However, from a correctness standpoint, we need to be able to index declarations made using multibyte characters, especially for languages that need these characters like Japanese.

Also, having the correct code unit locations means that all features that depend on index entries will just work (hover, definition, completion, signature help, workspace symbol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This PR will fix an existing bug server This pull request should be included in the server gem's release notes Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants