UDF to parse a column with an XML string #322

stevenmanton · 2018-08-06T21:40:46Z

I love this package, but I have often run into a scenario where I have a DataFrame with several columns, one of which contains an XML string that I would like to parse. Since this package only works with files, in order to parse the XML column we have to select the XML column, save it to disk, then read it using this library.

I'd love a UDF that I could call that would parse the column in place. For example, a new function parseXML that parses the XML string and returns a struct that you could reference in the normal way. Maybe something along the lines of the following.

(
    df
    .withColumn("parsed_XML", parseXML('xml_column'))
    .withColumn("field1", "parsed_XML.field1")
    .withColumn("array0", col("parsed_XML.array").getItem(0))
)

I'm happy to try to implement this, but I'm hoping the core devs can provide some early feedback. Is this doable? worthwhile? Any suggestions on the right approach?

The text was updated successfully, but these errors were encountered:

HyukjinKwon · 2018-08-07T02:07:41Z

If you are willing to make a implementation, sounds like a good idea (like from_json and to_json function implementations in Spark side).

Should be good to refer apache/spark#21838.

SpaceRangerWes · 2018-10-30T15:49:52Z

@stevenmanton @HyukjinKwon I attempted an initial implementation of this enhancement. Love to work with you to fulfill any requirements.

stevenmanton · 2018-11-05T23:28:12Z

Hey @SpaceRangerWes, thanks for taking a look at this! I'm definitely no Spark expert so I can't comment specifically on the code. We ended up just hard-coding a few XML-to-case-class parsers as UDFs. Using case classes has the nice property that Spark can automatically infer the schema.

srowen · 2018-12-22T02:58:53Z

Yes this is a duplicate of a few other issues, like #334

HyukjinKwon added the enhancement label Aug 7, 2018

SpaceRangerWes mentioned this issue Oct 29, 2018

Implementation of from_xml function #334

Closed

srowen closed this as completed Dec 22, 2018

HyukjinKwon mentioned this issue Dec 22, 2018

Use an RDD or Dataframe column with raw xml content as input for spark-xml #206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDF to parse a column with an XML string #322

UDF to parse a column with an XML string #322

stevenmanton commented Aug 6, 2018

HyukjinKwon commented Aug 7, 2018

SpaceRangerWes commented Oct 30, 2018

stevenmanton commented Nov 5, 2018

srowen commented Dec 22, 2018

UDF to parse a column with an XML string #322

UDF to parse a column with an XML string #322

Comments

stevenmanton commented Aug 6, 2018

HyukjinKwon commented Aug 7, 2018

SpaceRangerWes commented Oct 30, 2018

stevenmanton commented Nov 5, 2018

srowen commented Dec 22, 2018