Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quiet Spark Logging #5

Open
alope107 opened this issue Jul 20, 2015 · 4 comments
Open

Quiet Spark Logging #5

alope107 opened this issue Jul 20, 2015 · 4 comments

Comments

@alope107
Copy link
Contributor

There is a huge amount of logging by Spark by default which clutters up the terminal and confuses new users. Findspark should cut down on this logging. @freeman-lab recommended using the following to change the logging level at runtime:

log4j = sc._jvm.org.apache.log4j
log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)

This could be implemented in Findspark by monkey-patching the SparkContext like so:

import pyspark
old_init = pyspark.SparkContext.__init__
def new_init(self, *args, **kwargs):
    old_init(self, *args, **kwargs)
    log4j = self._jvm.org.apache.log4j
    log4j.LogManager.getRootLogger().setLevel(log4j.Level.ERROR)
pyspark.SparkContext.__init__ = new_init

This however feels like a fragile solution to me. We could instead modify the logger properties files at $SPARK_HOME/conf/log4j.properties but this changes the logging for all uses of Spark, and may be too heavyweight of a solution.

@minrk
Copy link
Owner

minrk commented Jul 20, 2015

I wouldn't monkeypatch SparkContext. We should be talking to the pyspark folks about how to expose options like the log level. Presumably you should be able to do

SparkContext(..., log_level=ERROR)

I don't think we need to do too much deviating from PySpark's default behavior here, but if the defaults don't make sense, we should bring it up with PySpark rather than forcibly overriding it.

@alope107
Copy link
Contributor Author

That makes sense. Looking into it further, there is a setLogLevel on the SparkContext, but I don't see any way to set it in the constructor. That would be a nice-to-have that we can bring up with pyspark developers.

It also appears that spark attempts to see if it is in a scala REPL and changes the logging accordingly if it is. The ideal fix would be to make it so that Spark recognizes that it is in a REPL in the python or IPython shell. I'll look into what that would entail.

@freeman-lab
Copy link
Contributor

Agree with @minrk that the monkey-patching, while clever, isn't really optimal. I like the idea of having it as an option in the constructor. It looks like it's not in the Scala version either (see here) and generally they like to aim for parity, so the nicest patch might be adding it as an argument to both versions.

I'd definitely suggest opening a JIRA ticket over on https://issues.apache.org/jira/browse/spark/ about adding it (and explaining the use case) and see what the other Spark devs think. If they're on board, we can put a patch together!

@alope107
Copy link
Contributor Author

I've created two issues on Spark's Jira.
https://issues.apache.org/jira/browse/SPARK-9227
https://issues.apache.org/jira/browse/SPARK-9226

One is to add an option in the constructor to the Spark Context to change the logging level.

The other is to use a different logging properties file when Spark detects that it is in the python REPL. This already occurs for the scala REPL, so it would just be bringing parity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants