-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDF to parse a column with an XML string #322
Comments
If you are willing to make a implementation, sounds like a good idea (like from_json and to_json function implementations in Spark side). Should be good to refer apache/spark#21838. |
@stevenmanton @HyukjinKwon I attempted an initial implementation of this enhancement. Love to work with you to fulfill any requirements. |
Hey @SpaceRangerWes, thanks for taking a look at this! I'm definitely no Spark expert so I can't comment specifically on the code. We ended up just hard-coding a few XML-to-case-class parsers as UDFs. Using case classes has the nice property that Spark can automatically infer the schema. |
Yes this is a duplicate of a few other issues, like #334 |
I love this package, but I have often run into a scenario where I have a DataFrame with several columns, one of which contains an XML string that I would like to parse. Since this package only works with files, in order to parse the XML column we have to select the XML column, save it to disk, then read it using this library.
I'd love a UDF that I could call that would parse the column in place. For example, a new function
parseXML
that parses the XML string and returns a struct that you could reference in the normal way. Maybe something along the lines of the following.I'm happy to try to implement this, but I'm hoping the core devs can provide some early feedback. Is this doable? worthwhile? Any suggestions on the right approach?
The text was updated successfully, but these errors were encountered: