Feat: Use tika to detect content type more accurately #303

topikachu · 2022-07-21T04:21:17Z

Use tika to detect content type more accurately.
I perform a manual test that the content type in s3 is expected.

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

topikachu · 2022-07-23T08:19:55Z

@jglick
Could you take a look at this change?

jglick

652K (no transitive deps) so will slow down

artifact-manager-s3-plugin/src/main/java/io/jenkins/plugins/artifact_manager_jclouds/JCloudsArtifactManager.java

Line 123 in 71934b1

    
           Map<String, String> contentTypes = workspace.act(new ContentTypeGuesser(new ArrayList<>(artifacts.keySet()), listener));

a bit when first called on a given agent (typically every build, when using one-shot Clouds). Not sure it is worth the cost to get more correct content types in some cases; would prefer for Files.probeContentType to be improved in the JDK. Did you have a particular content type that the plugin currently fails to detect properly?

topikachu · 2022-07-28T02:29:17Z

652K (no transitive deps) so will slow down

artifact-manager-s3-plugin/src/main/java/io/jenkins/plugins/artifact_manager_jclouds/JCloudsArtifactManager.java

Line 123 in 71934b1

Map<String, String> contentTypes = workspace.act(new ContentTypeGuesser(new ArrayList<>(artifacts.keySet()), listener));

a bit when first called on a given agent (typically every build, when using one-shot Clouds). Not sure it is worth the cost to get more correct content types in some cases; would prefer for Files.probeContentType to be improved in the JDK. Did you have a particular content type that the plugin currently fails to detect properly?

My ".log" file and ".json" files are all in "binary/octet-stream" type.
So when I click the file, the browser pops up a download dialog.

I noticed that it's MasterToSlaveFileCallable so the classes are sent to the agent.
I'm not sure if Files.probeContentType can be improved because it's at the agent side and use the system-wide SPI.

One solution is to create another ContentTypeGuesser using tika implementation and make this an option on the UI side

What do you think about it?

jglick · 2022-07-28T02:39:57Z

Ouch, *.json seems like a pretty basic failure.

Prefer to avoid cluttering the UI with obscure options.

A possible compromise: adjust ContentTypeGuesser to first try the Java Platform’s built-in techniques, as now. If these fail (binary/octet-stream fallback), then and only then use Tika. Might need to include the static references to Tika in a distinct method, to make sure the JVM does not try to load that JAR unless and until the method is called. (It is possible to use the support-core plugin to verify this interactively. There is probably some way to verify this in a functional test too, but that sounds like overkill.)

topikachu · 2022-07-29T04:20:24Z

Ouch, *.json seems like a pretty basic failure.

Prefer to avoid cluttering the UI with obscure options.

A possible compromise: adjust ContentTypeGuesser to first try the Java Platform’s built-in techniques, as now. If these fail (binary/octet-stream fallback), then and only then use Tika. Might need to include the static references to Tika in a distinct method, to make sure the JVM does not try to load that JAR unless and until the method is called. (It is possible to use the support-core plugin to verify this interactively. There is probably some way to verify this in a functional test too, but that sounds like overkill.)

I change the code that only uses tika when JRE can't detect the content type.
I run the test NetworkTest with -Dorg.jvnet.hudson.test.HudsonTestCase.slaveDebugPort=5005 so I can debug the agent side.
The hudson.remoting.RemoteClassLoader only downloads the 'tika' jar until the method is called.

jglick

Nice! Would you mind adding something to

artifact-manager-s3-plugin/src/test/java/io/jenkins/plugins/artifact_manager_jclouds/s3/JCloudsArtifactManagerTest.java

Lines 292 to 303 in 728e7db

    
           p.setDefinition(new CpsFlowDefinition("node('remote') {writeFile file: 'f.txt', text: '" + text + "'; writeFile file: 'f.html', text: '" + html + "'; writeFile file: 'f', text: '\\u0000'; archiveArtifacts 'f*'}", true)); 
        
           j.buildAndAssertSuccess(p); 
        
           WebResponse response = j.createWebClient().goTo("job/p/1/artifact/f.txt", null).getWebResponse(); 
        
           assertThat(response.getContentAsString(), equalTo(text)); 
        
           assertThat(response.getContentType(), equalTo("text/plain")); 
        
           response = j.createWebClient().goTo("job/p/1/artifact/f.html", null).getWebResponse(); 
        
           assertThat(response.getContentAsString(), equalTo(html)); 
        
           assertThat(response.getContentType(), equalTo("text/html")); 
        
           response = j.createWebClient().goTo("job/p/1/artifact/f", null).getWebResponse(); 
        
           assertThat(response.getContentLength(), equalTo(1L)); 
        
           assertThat(response.getContentType(), containsString("/octet-stream"));

showing that *.json or whatever now gets detected properly?

src/main/java/io/jenkins/plugins/artifact_manager_jclouds/JCloudsArtifactManager.java

jglick · 2022-07-29T15:27:26Z

(BTW for purposes of review it is nicer to push follow-on commits, as rebasing breaks incremental review. Does not matter much in this case since the diff is so short to begin with.)

Move Tika to a separate util class to avoid repeatedly initializing. Add test to check the correct json content type

topikachu · 2022-07-30T07:45:50Z

Nice! Would you mind adding something to

artifact-manager-s3-plugin/src/test/java/io/jenkins/plugins/artifact_manager_jclouds/s3/JCloudsArtifactManagerTest.java

Lines 292 to 303 in 728e7db

p.setDefinition(new CpsFlowDefinition("node('remote') {writeFile file: 'f.txt', text: '" + text + "'; writeFile file: 'f.html', text: '" + html + "'; writeFile file: 'f', text: '\\u0000'; archiveArtifacts 'f*'}", true));

j.buildAndAssertSuccess(p);

WebResponse response = j.createWebClient().goTo("job/p/1/artifact/f.txt", null).getWebResponse();

assertThat(response.getContentAsString(), equalTo(text));

assertThat(response.getContentType(), equalTo("text/plain"));

response = j.createWebClient().goTo("job/p/1/artifact/f.html", null).getWebResponse();

assertThat(response.getContentAsString(), equalTo(html));

assertThat(response.getContentType(), equalTo("text/html"));

response = j.createWebClient().goTo("job/p/1/artifact/f", null).getWebResponse();

assertThat(response.getContentLength(), equalTo(1L));

assertThat(response.getContentType(), containsString("/octet-stream"));

showing that *.json or whatever now gets detected properly?

Done!

jglick

Looks nice, thanks!

jglick added the enhancement label Jul 25, 2022

jglick reviewed Jul 26, 2022

View reviewed changes

Feat: Use tika to detect content type more accurately

e421cc7

topikachu force-pushed the feature/tika branch from 71934b1 to e421cc7 Compare July 29, 2022 01:32

jglick reviewed Jul 29, 2022

View reviewed changes

src/main/java/io/jenkins/plugins/artifact_manager_jclouds/JCloudsArtifactManager.java Outdated Show resolved Hide resolved

src/main/java/io/jenkins/plugins/artifact_manager_jclouds/JCloudsArtifactManager.java Outdated Show resolved Hide resolved

[Feat] Improve the Tika implementation

9627797

Move Tika to a separate util class to avoid repeatedly initializing. Add test to check the correct json content type

jglick approved these changes Aug 1, 2022

View reviewed changes

jglick enabled auto-merge August 1, 2022 13:56

jglick merged commit 6c461c3 into jenkinsci:master Aug 1, 2022

jglick mentioned this pull request Aug 8, 2022

Bump tika-core from 1.22 to 2.4.1 #309

Closed

jglick mentioned this pull request Jan 20, 2023

JCloudsArtifactManager fails to archive artifacts when their workspace path is not identical to their archive path #349

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Use tika to detect content type more accurately #303

Feat: Use tika to detect content type more accurately #303

topikachu commented Jul 21, 2022 •

edited

Loading

topikachu commented Jul 23, 2022

jglick left a comment •

edited

Loading

topikachu commented Jul 28, 2022 •

edited

Loading

jglick commented Jul 28, 2022 •

edited

Loading

topikachu commented Jul 29, 2022

jglick left a comment

jglick commented Jul 29, 2022

topikachu commented Jul 30, 2022

jglick left a comment

	p.setDefinition(new CpsFlowDefinition("node('remote') {writeFile file: 'f.txt', text: '" + text + "'; writeFile file: 'f.html', text: '" + html + "'; writeFile file: 'f', text: '\\u0000'; archiveArtifacts 'f*'}", true));
	j.buildAndAssertSuccess(p);

	WebResponse response = j.createWebClient().goTo("job/p/1/artifact/f.txt", null).getWebResponse();
	assertThat(response.getContentAsString(), equalTo(text));
	assertThat(response.getContentType(), equalTo("text/plain"));
	response = j.createWebClient().goTo("job/p/1/artifact/f.html", null).getWebResponse();
	assertThat(response.getContentAsString(), equalTo(html));
	assertThat(response.getContentType(), equalTo("text/html"));
	response = j.createWebClient().goTo("job/p/1/artifact/f", null).getWebResponse();
	assertThat(response.getContentLength(), equalTo(1L));
	assertThat(response.getContentType(), containsString("/octet-stream"));

Feat: Use tika to detect content type more accurately #303

Feat: Use tika to detect content type more accurately #303

Conversation

topikachu commented Jul 21, 2022 • edited Loading

topikachu commented Jul 23, 2022

jglick left a comment • edited Loading

Choose a reason for hiding this comment

topikachu commented Jul 28, 2022 • edited Loading

jglick commented Jul 28, 2022 • edited Loading

topikachu commented Jul 29, 2022

jglick left a comment

Choose a reason for hiding this comment

jglick commented Jul 29, 2022

topikachu commented Jul 30, 2022

jglick left a comment

Choose a reason for hiding this comment

topikachu commented Jul 21, 2022 •

edited

Loading

jglick left a comment •

edited

Loading

topikachu commented Jul 28, 2022 •

edited

Loading

jglick commented Jul 28, 2022 •

edited

Loading