Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDF to parse a column with an XML string #322

Closed
stevenmanton opened this issue Aug 6, 2018 · 4 comments
Closed

UDF to parse a column with an XML string #322

stevenmanton opened this issue Aug 6, 2018 · 4 comments

Comments

@stevenmanton
Copy link

I love this package, but I have often run into a scenario where I have a DataFrame with several columns, one of which contains an XML string that I would like to parse. Since this package only works with files, in order to parse the XML column we have to select the XML column, save it to disk, then read it using this library.

I'd love a UDF that I could call that would parse the column in place. For example, a new function parseXML that parses the XML string and returns a struct that you could reference in the normal way. Maybe something along the lines of the following.

(
    df
    .withColumn("parsed_XML", parseXML('xml_column'))
    .withColumn("field1", "parsed_XML.field1")
    .withColumn("array0", col("parsed_XML.array").getItem(0))
)

I'm happy to try to implement this, but I'm hoping the core devs can provide some early feedback. Is this doable? worthwhile? Any suggestions on the right approach?

@HyukjinKwon
Copy link
Member

If you are willing to make a implementation, sounds like a good idea (like from_json and to_json function implementations in Spark side).

Should be good to refer apache/spark#21838.

@SpaceRangerWes
Copy link

@stevenmanton @HyukjinKwon I attempted an initial implementation of this enhancement. Love to work with you to fulfill any requirements.

@stevenmanton
Copy link
Author

Hey @SpaceRangerWes, thanks for taking a look at this! I'm definitely no Spark expert so I can't comment specifically on the code. We ended up just hard-coding a few XML-to-case-class parsers as UDFs. Using case classes has the nice property that Spark can automatically infer the schema.

@srowen
Copy link
Collaborator

srowen commented Dec 22, 2018

Yes this is a duplicate of a few other issues, like #334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants