Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with jdt.ls javadoc #524

Open
tolusha opened this issue Jan 19, 2018 · 15 comments
Open

Encoding issue with jdt.ls javadoc #524

tolusha opened this issue Jan 19, 2018 · 15 comments
Assignees

Comments

@tolusha
Copy link
Contributor

tolusha commented Jan 19, 2018

It is turned out that the original source isn't utf-8 encoded.
I am not sure if it is possible to handle that.

vs code:
35158006-693ee2f0-fd3e-11e7-9fb5-7512cb1fc8a1

Che:
33475895-4a2825fa-d680-11e7-887b-492aff5c38f0

eclipse-che/che#7674

@snjeza snjeza self-assigned this Jan 19, 2018
@fbricon fbricon added the bug label Jan 19, 2018
@snjeza
Copy link
Contributor

snjeza commented Jan 21, 2018

This is an m2e bug. junit-3.8.1-sources.jar uses the ISO-8859-1 encoding and m2e doesn't allow changing source attachment encoding.
The issue can't be reproduced when using junit >= 3.8.2.
See https://bugs.eclipse.org/bugs/show_bug.cgi?id=385391

@fbricon Would you like me to try to fix this issue in m2e?

@fbricon
Copy link
Contributor

fbricon commented Jan 22, 2018

@snjeza before you try to fix m2e, do you have a general idea on how to do it?

@snjeza
Copy link
Contributor

snjeza commented Jan 22, 2018

We could add a maven property as the following:

<m2e.source.attachment.encoding>groupid:artifactId:version:ISO-8859-1</m2e.source.attachment.encoding>

@fbricon
Copy link
Contributor

fbricon commented Jan 22, 2018

Mmm that would not work OOTB. And you probably need to handle javadoc and sources differently.

An alternative might be to, before attaching the file, inspect the 1st (java) file in there and try to detect the encoding via ICU4J (see https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream/4013565#4013565), for instance.

@ifedorenko WDYT?

@ifedorenko
Copy link

Agree with Fred, the encoding has to come from somewhere and encoding detection seems like a reasonable thing to do. Although I guess it won't be 100% reliable, for things like mixed-encoding sources, not using all sources during autodetection, etc.

Other options you may want to try

  • look inside artifact pom.xml file for presence of source encoding property. Which may or may not be set correctly.
  • have hand-crafted and/or crowd-sourced artifact SHA -> encoding mapping.

good luck ;-)

@fbricon
Copy link
Contributor

fbricon commented Jan 22, 2018

@snjeza If going with ICU4J, make sure you're not depending on the icu4j bundle, but the packages, since jdt.ls only embeds the base icu4j jar

@snjeza
Copy link
Contributor

snjeza commented Jan 22, 2018

@fbricon @ifedorenko CharsetDetector returns UTF-8 for junit-3.8.1-sources.jar!/junit/framework/TestSuite.java.

@ifedorenko
Copy link

Autodetection will never be 100% reliable, especially if it does not consider all source files.

@snjeza
Copy link
Contributor

snjeza commented Jan 22, 2018

CharsetDetector doesn't work even when detecting TestSuite.java

A solution for a gradle project is:

apply plugin: 'java'
apply plugin: 'eclipse'

 eclipse {
    classpath {
        file {
            whenMerged {
                def source = entries.find { it.path.contains('junit-3.8.1-sources.jar') }
                source.entryAttributes['source_encoding'] = 'ISO-8859-1'
            }
        }
    }
}

@fbricon
Copy link
Contributor

fbricon commented Jan 22, 2018

@snjeza I'm curious, does icu4j detect anything other than utf-8 if you scan the entire junit-3.8.1-sources.jar?

@snjeza
Copy link
Contributor

snjeza commented Jan 22, 2018

I'm curious, does icu4j detect anything other than utf-8 if you scan the entire junit-3.8.1-sources.jar?

icu4j detects encoding correctly when scanning junit-3.8.1-sources.jar, icu4j-60.2-sources.jar, commons-io-2.6-sources.jar ...

If going with ICU4J, make sure you're not depending on the icu4j bundle, but the packages, since jdt.ls only embeds the base icu4j jar

@fbricon we should add the icu4j dependency to the org.eclipse.m2e.jdt bundle. jdt.ls would have to include the icu4j bundle.
com.ibm.icu.base doesn't include CharsetDetector.

@fbricon
Copy link
Contributor

fbricon commented Jan 22, 2018

  • how long does it take to full-scan a jar?
  • I'm interested to know whether we can accurately detect encoding after scanning say 5, 10, 100 files in a jar. I'd rather sample the scanning than perform a complete archive scan
  • can you see if other frameworks produce similar results? I'm not thrilled by the idea of adding another 12MB to jdt.ls simply to be able to detect source code encoding

@fbricon fbricon added this to the End January 2018 milestone Jan 24, 2018
snjeza added a commit to snjeza/m2e-core-tests that referenced this issue Jan 31, 2018
Test for eclipse-jdtls/eclipse.jdt.ls#524

Signed-off-by: Snjezana Peco <[email protected]>
@snjeza
Copy link
Contributor

snjeza commented Jan 31, 2018

@fbricon
Copy link
Contributor

fbricon commented Mar 27, 2018

Since encoding detection gives worse results than the default UTF-8 encoding,
we're gonna need a new command to explicitly set encoding to a URI (now that m2e persists that information).

@fbricon fbricon removed this from the End March 2018 milestone Mar 27, 2018
@fbricon
Copy link
Contributor

fbricon commented Apr 12, 2018

WIP to expose a setSourceEncoding command:

set-encoding

@fbricon fbricon assigned fbricon and unassigned snjeza Jun 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants