Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Use tika to detect content type more accurately #303

Merged
merged 2 commits into from
Aug 1, 2022

Conversation

topikachu
Copy link

@topikachu topikachu commented Jul 21, 2022

Use tika to detect content type more accurately.
I perform a manual test that the content type in s3 is expected.
image

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

Sorry, something went wrong.

@topikachu
Copy link
Author

@jglick
Could you take a look at this change?

@jglick jglick added the enhancement New feature or request label Jul 25, 2022
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

652K (no transitive deps) so will slow down

Map<String, String> contentTypes = workspace.act(new ContentTypeGuesser(new ArrayList<>(artifacts.keySet()), listener));
a bit when first called on a given agent (typically every build, when using one-shot Clouds). Not sure it is worth the cost to get more correct content types in some cases; would prefer for Files.probeContentType to be improved in the JDK. Did you have a particular content type that the plugin currently fails to detect properly?

@topikachu
Copy link
Author

topikachu commented Jul 28, 2022

652K (no transitive deps) so will slow down

Map<String, String> contentTypes = workspace.act(new ContentTypeGuesser(new ArrayList<>(artifacts.keySet()), listener));

a bit when first called on a given agent (typically every build, when using one-shot Clouds). Not sure it is worth the cost to get more correct content types in some cases; would prefer for Files.probeContentType to be improved in the JDK. Did you have a particular content type that the plugin currently fails to detect properly?

My ".log" file and ".json" files are all in "binary/octet-stream" type.
So when I click the file, the browser pops up a download dialog.

I noticed that it's MasterToSlaveFileCallable so the classes are sent to the agent.
I'm not sure if Files.probeContentType can be improved because it's at the agent side and use the system-wide SPI.

One solution is to create another ContentTypeGuesser using tika implementation and make this an option on the UI side

What do you think about it?

@jglick
Copy link
Member

jglick commented Jul 28, 2022

Ouch, *.json seems like a pretty basic failure.

Prefer to avoid cluttering the UI with obscure options.

A possible compromise: adjust ContentTypeGuesser to first try the Java Platform’s built-in techniques, as now. If these fail (binary/octet-stream fallback), then and only then use Tika. Might need to include the static references to Tika in a distinct method, to make sure the JVM does not try to load that JAR unless and until the method is called. (It is possible to use the support-core plugin to verify this interactively. There is probably some way to verify this in a functional test too, but that sounds like overkill.)

@topikachu
Copy link
Author

Ouch, *.json seems like a pretty basic failure.

Prefer to avoid cluttering the UI with obscure options.

A possible compromise: adjust ContentTypeGuesser to first try the Java Platform’s built-in techniques, as now. If these fail (binary/octet-stream fallback), then and only then use Tika. Might need to include the static references to Tika in a distinct method, to make sure the JVM does not try to load that JAR unless and until the method is called. (It is possible to use the support-core plugin to verify this interactively. There is probably some way to verify this in a functional test too, but that sounds like overkill.)

I change the code that only uses tika when JRE can't detect the content type.
I run the test NetworkTest with -Dorg.jvnet.hudson.test.HudsonTestCase.slaveDebugPort=5005 so I can debug the agent side.
The hudson.remoting.RemoteClassLoader only downloads the 'tika' jar until the method is called.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Would you mind adding something to

p.setDefinition(new CpsFlowDefinition("node('remote') {writeFile file: 'f.txt', text: '" + text + "'; writeFile file: 'f.html', text: '" + html + "'; writeFile file: 'f', text: '\\u0000'; archiveArtifacts 'f*'}", true));
j.buildAndAssertSuccess(p);
WebResponse response = j.createWebClient().goTo("job/p/1/artifact/f.txt", null).getWebResponse();
assertThat(response.getContentAsString(), equalTo(text));
assertThat(response.getContentType(), equalTo("text/plain"));
response = j.createWebClient().goTo("job/p/1/artifact/f.html", null).getWebResponse();
assertThat(response.getContentAsString(), equalTo(html));
assertThat(response.getContentType(), equalTo("text/html"));
response = j.createWebClient().goTo("job/p/1/artifact/f", null).getWebResponse();
assertThat(response.getContentLength(), equalTo(1L));
assertThat(response.getContentType(), containsString("/octet-stream"));
showing that *.json or whatever now gets detected properly?

@jglick
Copy link
Member

jglick commented Jul 29, 2022

(BTW for purposes of review it is nicer to push follow-on commits, as rebasing breaks incremental review. Does not matter much in this case since the diff is so short to begin with.)

Move Tika to a separate util class to avoid repeatedly initializing.
Add test to check the correct json content type
@topikachu
Copy link
Author

Nice! Would you mind adding something to

p.setDefinition(new CpsFlowDefinition("node('remote') {writeFile file: 'f.txt', text: '" + text + "'; writeFile file: 'f.html', text: '" + html + "'; writeFile file: 'f', text: '\\u0000'; archiveArtifacts 'f*'}", true));
j.buildAndAssertSuccess(p);
WebResponse response = j.createWebClient().goTo("job/p/1/artifact/f.txt", null).getWebResponse();
assertThat(response.getContentAsString(), equalTo(text));
assertThat(response.getContentType(), equalTo("text/plain"));
response = j.createWebClient().goTo("job/p/1/artifact/f.html", null).getWebResponse();
assertThat(response.getContentAsString(), equalTo(html));
assertThat(response.getContentType(), equalTo("text/html"));
response = j.createWebClient().goTo("job/p/1/artifact/f", null).getWebResponse();
assertThat(response.getContentLength(), equalTo(1L));
assertThat(response.getContentType(), containsString("/octet-stream"));

showing that *.json or whatever now gets detected properly?

Done!

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants