Skip to content

Commit

Permalink
[SPARK-19223][SQL][PYSPARK] Fix InputFileBlockHolder for datasources …
Browse files Browse the repository at this point in the history
…which are based on HadoopRDD or NewHadoopRDD

## What changes were proposed in this pull request?

For some datasources which are based on HadoopRDD or NewHadoopRDD, such as spark-xml, InputFileBlockHolder doesn't work with Python UDF.

The method to reproduce it is, running the following codes with `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`:

    from pyspark.sql.functions import udf,input_file_name
    from pyspark.sql.types import StringType
    from pyspark.sql import SparkSession

    def filename(path):
        return path

    session = SparkSession.builder.appName('APP').getOrCreate()

    session.udf.register('sameText', filename)
    sameText = udf(filename, StringType())

    df = session.read.format('xml').load('a.xml', rowTag='root').select('*', input_file_name().alias('file'))
    df.select('file').show() # works
    df.select(sameText(df['file'])).show()   # returns empty content

The issue is because in `HadoopRDD` and `NewHadoopRDD` we set the file block's info in `InputFileBlockHolder` before the returned iterator begins consuming. `InputFileBlockHolder` will record this info into thread local variable. When running Python UDF in batch, we set up another thread to consume the iterator from child plan's output rdd, so we can't read the info back in another thread.

To fix this, we have to set the info in `InputFileBlockHolder` after the iterator begins consuming. So the info can be read in correct thread.

## How was this patch tested?

Manual test with above example codes for spark-xml package on pyspark: `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`.

Added pyspark test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#16585 from viirya/fix-inputfileblock-hadooprdd.
  • Loading branch information
viirya authored and cmonkey committed Feb 15, 2017
1 parent 41a7895 commit f098e84
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -41,9 +41,10 @@ private[spark] object InputFileBlockHolder {
* The thread variable for the name of the current file being read. This is used by
* the InputFileName function in Spark SQL.
*/
private[this] val inputBlock: ThreadLocal[FileBlock] = new ThreadLocal[FileBlock] {
override protected def initialValue(): FileBlock = new FileBlock
}
private[this] val inputBlock: InheritableThreadLocal[FileBlock] =
new InheritableThreadLocal[FileBlock] {
override protected def initialValue(): FileBlock = new FileBlock
}

/**
* Returns the holding file name or empty string if it is unknown.
Expand Down
24 changes: 24 additions & 0 deletions python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -435,6 +435,30 @@ def test_udf_with_input_file_name(self):
row = self.spark.read.json(filePath).select(sourceFile(input_file_name())).first()
self.assertTrue(row[0].find("people1.json") != -1)

def test_udf_with_input_file_name_for_hadooprdd(self):
from pyspark.sql.functions import udf, input_file_name
from pyspark.sql.types import StringType

def filename(path):
return path

sameText = udf(filename, StringType())

rdd = self.sc.textFile('python/test_support/sql/people.json')
df = self.spark.read.json(rdd).select(input_file_name().alias('file'))
row = df.select(sameText(df['file'])).first()
self.assertTrue(row[0].find("people.json") != -1)

rdd2 = self.sc.newAPIHadoopFile(
'python/test_support/sql/people.json',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text')

df2 = self.spark.read.json(rdd2).select(input_file_name().alias('file'))
row2 = df2.select(sameText(df2['file'])).first()
self.assertTrue(row2[0].find("people.json") != -1)

def test_basic_functions(self):
rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
df = self.spark.read.json(rdd)
Expand Down

0 comments on commit f098e84

Please sign in to comment.