-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48238][BUILD][YARN] Replace YARN AmIpFilter with a forked implementation #46611
Conversation
+CC @tgravescs |
The patch makes sense to me, and supposedly the workaround will be removed when the Yarn side upgrades their J2EE? |
I suppose not. According to the discussion in #31642
|
|
||
import org.apache.spark.internal.Logging | ||
|
||
// This class is inspired by org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify this, is it just copy converted to scala or did you add/change functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the original code is Java, I'd prefer we also add a Java file in Spark, so that it's easier to keep the code in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a functional equivalent rewrite. Simplify the code by converting to Scala and merging code from several Hadoop Java classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was consider to port the code as-is(also keeping package and class name) as we did in the thriftserver
module for hive-service
, while it would cause class conflicts because hadoop-client-minicluster
(present in test scope) ships those classes too. Meanwhile, I found an existing YarnProxyRedirectFilter
which is much simpler and is not coupled with Hadoop classes. So I rename the class to YarnAMIpFilter
, rewrite it in Scala, and put it alongside YarnProxyRedirectFilter
. UT YarnAMIpFilterSuite
is ported too to ensure the correctness of rewriting.
@cloud-fan if we pursue sync with Hadoop upstream, I can convert it back to Java classes, and reuse the Hadoop code as much as possible. For example, use org.apache.hadoop.yarn.webapp.hamlet2.Hamlet
to construct the HTML page in sendRedirect
instead of writing by hand as we did in YarnProxyRedirectFilter
.
This looks like a good idea to get rid of the conflicting dependency. We should clarify #46611 (comment) though. |
@cloud-fan I updated the code to reserve original code as much as possible, also leave comments to clarify the modifications. |
@@ -696,7 +696,7 @@ private[spark] class ApplicationMaster( | |||
|
|||
/** Add the Yarn IP filter that is required for properly securing the UI. */ | |||
private def addAmIpFilter(driver: Option[RpcEndpointRef], proxyBase: String) = { | |||
val amFilter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter" | |||
val amFilter = "org.apache.spark.deploy.yarn.AmIpFilter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use classOf
to avoid hardcoding the class name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -86,7 +86,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: Time) | |||
} | |||
|
|||
// Add Yarn proxy filter specific configurations to the recovered SparkConf | |||
val filter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter" | |||
val filter = "org.apache.spark.deploy.yarn.AmIpFilter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the class defined in the yarn
module is not visible in the streaming
module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can't use classOf[AmIpFilter].getName
here due to dependency issues.
proxyUriBases.put("dummy", conf.getInitParameter(PROXY_URI_BASE)); | ||
} else { | ||
proxyHosts = conf.getInitParameter(PROXY_HOSTS) | ||
.split(PROXY_HOSTS_DELIMITER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Indentation, please fix it globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
import org.apache.spark.internal.SparkLogger; | ||
import org.apache.spark.internal.SparkLoggerFactory; | ||
|
||
// This class is copied from Hadoop 3.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do we need to re-copy these files every time we upgrade Hadoop if there are changes to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We don't need to always sync it with the Hadoop version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Merged into master for Spark 4.0. Thanks @pan3793 and @cloud-fan |
Thank you, @pan3793 and all! |
What changes were proposed in this pull request?
This PR replaces AmIpFilter with a forked implementation, and removes the dependency
hadoop-yarn-server-web-proxy
Why are the changes needed?
SPARK-47118 upgraded Spark built-in Jetty from 10 to 11, and migrated from
javax.servlet
tojakarta.servlet
, which breaks the Spark on YARN.During the investigation, I found a comment here #31642 (comment)
This should be a simple and clean way to address the exact issue, then we don't need to wait for Hadoop
jakarta.servlet
migration, and it also strips a Hadoop dependency.Does this PR introduce any user-facing change?
No, this recovers the bootstrap of the Spark application on YARN mode, keeping the same behavior with Spark 3.5 and earlier versions.
How was this patch tested?
UTs are added. (refer to
org.apache.hadoop.yarn.server.webproxy.amfilter.TestAmFilter
)I tested it in a YARN cluster.
Spark successfully started.
When access
http://hadoop-master1.orb.local:4040
, it redirects tohttp://hadoop-master1.orb.local:8088/proxy/redirect/application_1716005503866_0001/
, and the UI looks correct.Was this patch authored or co-authored using generative AI tooling?
No