-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4103][SQL] Clean up SessionState in HiveContext by using ThreadLocal varaible in Hive #2967
Conversation
…HiveContext is lazy evaluated, and to start the session, SessionState.start() has to be invoked to initialize it. Using ThreadLocal variable natively in hive to track the availability of SessionState and initialize it if required. It avoid multiple thread sharing the same SessionState or single thread invoking SessionState.start with the same SessionState multiple times, which seems to cause unexpected behavior.
This PR may duplicate with Spark-4037 partially. @liancheng Please take a look at the code change |
Can one of the admins verify this patch? |
SessionState.start(ss) | ||
ss.err = new PrintStream(outputBuffer, true, "UTF-8") | ||
ss.out = new PrintStream(outputBuffer, true, "UTF-8") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use lazy eval, we have to relies on SessionState.start(sessionState) to initialize it. It may be invoked multiple times in various locations. Here we relies on checking ThreadLocal variable and initialize it at the first call to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments here:
HiveContext.hiveconf
should be retrieved from the currentSessionState
ifSessionState.get()
is non-null. That's whyhiveconf
andsessionState
are always initialized together in #2887- IO redirection and Hive properties propagation also need to be performed when
SessionState.get()
returns a non-nullSessionState
started elsewhere. This is required by Spark SQL CLI sinceSparkSQLCLIDriver
creates aCliSessionState
beforeHiveContext
initialization. - I guess you're trying to make
HiveContext
play well with multi-session scenarios by introducinggetSessionState()
here? I had also considered this issue earlier, but unfortunately IMO proper multiple Hive session support may require major refactoring overHiveContext
(e.g. make it absolutely thread safe). Also, Spark SQL Hive support will probably be reimplemented against the up coming foreign data source API. So I'd rather defer this major task later.
(As a globally used thread local object, Hive SessionState
is really annoying...)
Currently the sessionState in HiveContext is lazy evaluated, and to start the session, SessionState.start() has to be invoked to initialize it. Using ThreadLocal variable natively in hive to track the availability of SessionState and initialize it if required. It avoid multiple thread sharing the same SessionState or single thread invoke SessionState.start with the same SessionState multiple times, which seems to cause unexpected behavior, and break the contract defined in hive. Refer below comments form SessionState in hive.
/**
*/
public static SessionState start(SessionState startSs) {
...
}