Quiet Spark Logging #5

alope107 · 2015-07-20T20:38:50Z

There is a huge amount of logging by Spark by default which clutters up the terminal and confuses new users. Findspark should cut down on this logging. @freeman-lab recommended using the following to change the logging level at runtime:

log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

This could be implemented in Findspark by monkey-patching the SparkContext like so:

import pyspark
old_init = pyspark.SparkContext.__init__
def new_init(self, *args, **kwargs):
    old_init(self, *args, **kwargs)
    log4j = self._jvm.org.apache.log4j
    log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
pyspark.SparkContext.__init__ = new_init

This however feels like a fragile solution to me. We could instead modify the logger properties files at $SPARK_HOME/conf/log4j.properties but this changes the logging for all uses of Spark, and may be too heavyweight of a solution.

The text was updated successfully, but these errors were encountered:

minrk · 2015-07-20T21:14:22Z

I wouldn't monkeypatch SparkContext. We should be talking to the pyspark folks about how to expose options like the log level. Presumably you should be able to do

SparkContext(..., log_level=ERROR)

I don't think we need to do too much deviating from PySpark's default behavior here, but if the defaults don't make sense, we should bring it up with PySpark rather than forcibly overriding it.

alope107 · 2015-07-20T22:20:06Z

That makes sense. Looking into it further, there is a setLogLevel on the SparkContext, but I don't see any way to set it in the constructor. That would be a nice-to-have that we can bring up with pyspark developers.

It also appears that spark attempts to see if it is in a scala REPL and changes the logging accordingly if it is. The ideal fix would be to make it so that Spark recognizes that it is in a REPL in the python or IPython shell. I'll look into what that would entail.

freeman-lab · 2015-07-20T22:30:29Z

Agree with @minrk that the monkey-patching, while clever, isn't really optimal. I like the idea of having it as an option in the constructor. It looks like it's not in the Scala version either (see here) and generally they like to aim for parity, so the nicest patch might be adding it as an argument to both versions.

I'd definitely suggest opening a JIRA ticket over on https://issues.apache.org/jira/browse/spark/ about adding it (and explaining the use case) and see what the other Spark devs think. If they're on board, we can put a patch together!

alope107 · 2015-07-21T19:24:28Z

I've created two issues on Spark's Jira.
https://issues.apache.org/jira/browse/SPARK-9227
https://issues.apache.org/jira/browse/SPARK-9226

One is to add an option in the constructor to the Spark Context to change the logging level.

The other is to use a different logging properties file when Spark detects that it is in the python REPL. This already occurs for the scala REPL, so it would just be bringing parity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quiet Spark Logging #5

Quiet Spark Logging #5

alope107 commented Jul 20, 2015

minrk commented Jul 20, 2015

alope107 commented Jul 20, 2015

freeman-lab commented Jul 20, 2015

alope107 commented Jul 21, 2015

Quiet Spark Logging #5

Quiet Spark Logging #5

Comments

alope107 commented Jul 20, 2015

minrk commented Jul 20, 2015

alope107 commented Jul 20, 2015

freeman-lab commented Jul 20, 2015

alope107 commented Jul 21, 2015