Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50082][CORE][ML] Upgrade pmml-model to 1.7.1 and migrate jaxb-api to jakarta.xml.bind-api, activation to jakarta.activation-api to remove jersey-related warning logs #48611

Closed
wants to merge 2 commits into from

Conversation

wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Oct 23, 2024

What changes were proposed in this pull request?

This PR aims to:

  1. Upgrade pmml-model to 1.7.1;
  2. Migrate javax.activation:jaxb-api to jakarta.xml.bind:jakarta.xml.bind-api, javax.activation:activation to jakarta.activation:jakarta.activation-api to remove some jersey-related warning logs when ApiRootResource or PrometheusResource api was called.
    When we start spark-shell with the latest master code, open the Spark UI, click the executor tab, you can see relevant warning logs appearing in the spark-shell.
    image
    image

Why are the changes needed?

  1. Upgrade pmml-model to the latest version;
  2. Migrate javax.* to jakarta.*;
  3. Eliminate useless warning logs;

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  1. Passed GA.
  2. Manual check. After this pr, the warning logs are removed.

The configuration of metrics.properties is as follows:

*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus

./bin/spark-shell --conf spark.ui.prometheus.enabled=true --conf spark.metrics.conf=conf/metrics.properties
curl http://192.168.124.7:4040/api/v1/version
curl http://192.168.124.7:4040/metrics/executors/prometheus

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Oct 23, 2024
@LuciferYang
Copy link
Contributor

LuciferYang commented Oct 23, 2024

cc @pan3793 and @dongjoon-hyun FYI

@LuciferYang
Copy link
Contributor

also cc @panbingkun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @wayneguow . I also observed the issue.

Since this is a log-related thing, it seems a little hard to validate.

Could you revise your PR description about how to validate the contributions on the followings from their logs?

  • All Spark shell environments (spark-shell, spark-sql, pyspark, sparkR) are fixed?
  • Spark Deamons (Spark Master/Worker/HistoryServer) are fixed?
  • Spark Applications on Spark Standalone Cluster and K9s Cluster (at least)?

@LuciferYang
Copy link
Contributor

Are there any negative impacts of disabling these features? Is it possible to retain these features through upgrading dependencies or code changes?

@dongjoon-hyun
Copy link
Member

Gentle ping, @wayneguow . Could you answer the above questions?

@wayneguow
Copy link
Contributor Author

Gentle ping, @wayneguow . Could you answer the above questions?

@dongjoon-hyun Sorry, I forgot about this, I will confirm the details and give the final solution and reason in the next few days.

@dongjoon-hyun
Copy link
Member

Thank you so much!

@wayneguow
Copy link
Contributor Author

wayneguow commented Jan 19, 2025

After doing further research, I found out why there are these two warning logs in Spark 4.0 version, but not in Spark 3.x and earlier versions:

  1. When Spark 4.0 upgraded jetty 10 to 11, it also upgraded jesery 2.41 to 3.0.12, related PR: [SPARK-47118][BUILD][CORE][SQL][UI] Migrate from Jetty 10 to Jetty 11 #45154

  2. Relevant changes in jesery 3.x version(eclipse-ee4j/jersey@c522d34):

  • MessagingBinders.java:
    DATASOURCE("jakarta.activation.DataSource") has been added to EnabledProvidersBinder.
  • WadlFeature.java:
    Added some check logic and warning logs.

If the following two dependencies are in the class path, there will be no corresponding warning logs, but we excluded it in this PR: #25481
jakarta.activation:jakarta.activation-api
jakarta.xml.bind:jakarta.xml.bind-api

@wayneguow
Copy link
Contributor Author

wayneguow commented Jan 19, 2025

Are there any negative impacts of disabling these features? Is it possible to retain these features through upgrading dependencies or code changes?

For the first question:

For the second question:
Upgrading to the current jersey version does not work, related code and logic still exists in the latest version. The logic of printing relevant warning logs is to find relevant classes in the class path. What we can do is to put the corresponding jars in the class path, but there seems to be a greater risk in implementing this, that is the conflict between javax and jakarta.

@dongjoon-hyun
Copy link
Member

Thank you for sharing that, @wayneguow .

@dongjoon-hyun
Copy link
Member

Given that, this PR is the best and safe way to handle these, right, @wayneguow ?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@wayneguow
Copy link
Contributor Author

Given that, this PR is the best and safe way to handle these, right, @wayneguow ?

@dongjoon-hyun In my opinion, it's like this. We can also wait for @LuciferYang 's opinion.

@dongjoon-hyun
Copy link
Member

Ack! Sure, let's wait for @LuciferYang 's opinion.

@wayneguow
Copy link
Contributor Author

Thank you so much, @wayneguow . I also observed the issue.

Since this is a log-related thing, it seems a little hard to validate.

Could you revise your PR description about how to validate the contributions on the followings from their logs?

  • All Spark shell environments (spark-shell, spark-sql, pyspark, sparkR) are fixed?
  • Spark Deamons (Spark Master/Worker/HistoryServer) are fixed?
  • Spark Applications on Spark Standalone Cluster and K9s Cluster (at least)?

Other than that, I'm checking the items listed here one by one and will update the PR description if it's all finished.

@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 20, 2025

After doing further research, I found out why there are these two warning logs in Spark 4.0 version, but not in Spark 3.x and earlier versions:

  1. When Spark 4.0 upgraded jetty 10 to 11, it also upgraded jesery 2.41 to 3.0.12, related PR: [SPARK-47118][BUILD][CORE][SQL][UI] Migrate from Jetty 10 to Jetty 11 #45154
  2. Relevant changes in jesery 3.x version(eclipse-ee4j/jersey@c522d34):
  • MessagingBinders.java:
    DATASOURCE("jakarta.activation.DataSource") has been added to EnabledProvidersBinder.
  • WadlFeature.java:
    Added some check logic and warning logs.

If the following two dependencies are in the class path, there will be no corresponding warning logs, but we excluded it in this PR: #25481 jakarta.activation:jakarta.activation-api jakarta.xml.bind:jakarta.xml.bind-api

We upgraded from 2.41 to 3.0.x. Actually, there was a similar registration in 2.41 as well, with just a slight difference in the namespace between javax and jakarta

https://github.com/eclipse-ee4j/jersey/blob/3eac07f0b10ada9ba4828785ad2cc0ff5cf19f9f/core-common/src/main/java/org/glassfish/jersey/message/internal/MessagingBinders.java#L172

image

Why didn't we need to make settings similar to the current pr before? I want to know if we revert the changes made by #25481, will the issue described in #25481 still exist? And could it also resolve the current problem?

@pan3793
Copy link
Member

pan3793 commented Jan 21, 2025

If the following two dependencies are in the class path, there will be no corresponding warning logs, but we excluded it in this PR: #25481

  • jakarta.activation:jakarta.activation-api
  • jakarta.xml.bind:jakarta.xml.bind-api

A little bit of further investigation:

Spark already pulls jakarta.xml.bind-api into the runtime classpath, in mllib module via jpmml-model 1.4

jakarta.xml.bind-api/2.3.2//jakarta.xml.bind-api-2.3.2.jar

while this version actually contains javax.xml.bind. classes.

jpmml migrated to jakarta.xml.bind. since 1.5, see jpmml/jpmml-model#41.

I think Spark eventually needs to follow the Jakarta EE specification to migrate to jakarta namespace, I did a quick search on Spark's codebase, seems it has no remaining direct reference for javax. that needs to migrate, but 3rd libs like jpmml might have the transitive reference, it might be a good time to evaluate upgrading jpmml to 1.5 or a newer version before Spark 4.0

find . -type f \( -name "*.scala" -o -name "*.java" \) -exec grep -h "^import javax\." {} + | awk '{ sub(/;$/, ""); print }' | sort | uniq

Back to the issue specific to this PR, I think either the current approach or @LuciferYang's suggestion has no real runtime differences because Spark only uses a few set features of Jersey(which integrates with a lot of Jakarta EE APIs), but I lean toward @LuciferYang's suggestion because it simplifies both pom.xml and code -we don't need to carefully analysis the runtime code path and explicitly exclude/disable everything that we don't need, as long as it does not have side-effects.

@LuciferYang
Copy link
Contributor

LuciferYang commented Jan 22, 2025

If the following two dependencies are in the class path, there will be no corresponding warning logs, but we excluded it in this PR: #25481

  • jakarta.activation:jakarta.activation-api
  • jakarta.xml.bind:jakarta.xml.bind-api

A little bit of further investigation:

Spark already pulls jakarta.xml.bind-api into the runtime classpath, in mllib module via jpmml-model 1.4

jakarta.xml.bind-api/2.3.2//jakarta.xml.bind-api-2.3.2.jar

while this version actually contains javax.xml.bind. classes.

jpmml migrated to jakarta.xml.bind. since 1.5, see jpmml/jpmml-model#41.

I think Spark eventually needs to follow the Jakarta EE specification to migrate to jakarta namespace, I did a quick search on Spark's codebase, seems it has no remaining direct reference for javax. that needs to migrate, but 3rd libs like jpmml might have the transitive reference, it might be a good time to evaluate upgrading jpmml to 1.5 or a newer version before Spark 4.0

find . -type f \( -name "*.scala" -o -name "*.java" \) -exec grep -h "^import javax\." {} + | awk '{ sub(/;$/, ""); print }' | sort | uniq

Back to the issue specific to this PR, I think either the current approach or @LuciferYang's suggestion has no real runtime differences because Spark only uses a few set features of Jersey(which integrates with a lot of Jakarta EE APIs), but I lean toward @LuciferYang's suggestion because it simplifies both pom.xml and code -we don't need to carefully analysis the runtime code path and explicitly exclude/disable everything that we don't need, as long as it does not have side-effects.

If time permits, I suggest we try to migrate them to jakarta before Spark 4.0 release

@wayneguow wayneguow changed the title [SPARK-50082][CORE] Remove some unnecessary Jersey-related warning logs [SPARK-50082][CORE][ML] Upgrade pmml-model to 1.7.1 and migrate to jaxb-api to jakarta.xml.bind-api, activation to jakarta.activation-api to avoid unnecessary jersey-related warning logs Feb 1, 2025
@wayneguow wayneguow changed the title [SPARK-50082][CORE][ML] Upgrade pmml-model to 1.7.1 and migrate to jaxb-api to jakarta.xml.bind-api, activation to jakarta.activation-api to avoid unnecessary jersey-related warning logs [SPARK-50082][CORE][ML] Upgrade pmml-model to 1.7.1 and migrate jaxb-api to jakarta.xml.bind-api, activation to jakarta.activation-api to avoid unnecessary jersey-related warning logs Feb 1, 2025
@wayneguow wayneguow changed the title [SPARK-50082][CORE][ML] Upgrade pmml-model to 1.7.1 and migrate jaxb-api to jakarta.xml.bind-api, activation to jakarta.activation-api to avoid unnecessary jersey-related warning logs [SPARK-50082][CORE][ML] Upgrade pmml-model to 1.7.1 and migrate jaxb-api to jakarta.xml.bind-api, activation to jakarta.activation-api to remove jersey-related warning logs Feb 1, 2025
@wayneguow
Copy link
Contributor Author

@dongjoon-hyun @LuciferYang @pan3793 Late on updating the PR. If you all get a chance, I'd appreciate another look!

@wayneguow
Copy link
Contributor Author

Also cc @zhengruifeng for the ML part.

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Feb 6, 2025

@LuciferYang and @wayneguow

It seems we need to change the code for pmml-model 1.4.8 -> 1.7.1 upgrade, so I am a bit worry about the compatibility, I suggest:
1, splitting this PR, can we make a separate PR for pmml-model upgrade first?
2, adding additional tests to make sure pmml-model 1.7.1 can successfully load previous models. We can save 1~2 pmml models into mllib/src/test/resources/ml-models with Spark 3.5, and then load them in the tests. you can refer to 6b7527e

also cc @WeichenXu123

@wayneguow
Copy link
Contributor Author

@LuciferYang and @wayneguow

It seems we need to change the code for pmml-model 1.4.8 -> 1.7.1 upgrade, so I am a bit worry about the compatibility, I suggest: 1, splitting this PR, can we make a separate PR for pmml-model upgrade first? 2, adding additional tests to make sure pmml-model 1.7.1 can successfully load previous models. We can save 1~2 pmml models into mllib/src/test/resources/ml-models with Spark 3.5, and then load them in the tests. you can refer to 6b7527e

also cc @WeichenXu123

Okay, I agree with your suggestion, let me first ensure that pmml-model can be upgraded without loss of compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants