-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of from_xml function #334
Conversation
Note on the build fail: This is occurring for Scala: 2.10.5 JDK: openjdk7 on a preexisting dependency. I don't know of any of my code that would utilize or break this dependency. Therefore, any dependency tree management is out of scope for this task. |
Hi, I have the similar requirement, where I need to do analysis on a column with xml string. Will be great if you could provide an example on how to use this . |
@hina1234 I added a unit test that shows an example of using what I have written. Looking at it now, the test is poorly name, but it should be what you are looking for https://github.com/SpaceRangerWes/spark-xml/blob/master/src/test/scala/com/databricks/spark/xml/XmlSuite.scala#L919-L941 |
Sounds good to have it. Can you rebase and make the tests pass? |
Codecov Report
@@ Coverage Diff @@
## master #334 +/- ##
==========================================
- Coverage 87.51% 86.54% -0.98%
==========================================
Files 14 15 +1
Lines 737 773 +36
Branches 96 86 -10
==========================================
+ Hits 645 669 +24
- Misses 92 104 +12
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #334 +/- ##
==========================================
- Coverage 87.58% 87.46% -0.12%
==========================================
Files 14 15 +1
Lines 741 766 +25
Branches 91 85 -6
==========================================
+ Hits 649 670 +21
- Misses 92 96 +4
Continue to review full report at Codecov.
|
@HyukjinKwon finally rebased. Sorry that it took so long. The holidays were the only time I had available. |
src/main/scala/com/databricks/spark/xml/XmlDataToCatalyst.scala
Outdated
Show resolved
Hide resolved
|
||
case class XmlDataToCatalyst(child: Expression, | ||
schema: DataType, | ||
options: XmlOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
case class XmlDataToCatalyst(
child: Expression,
schema: DataType,
options: XmlOptions)
extends ...
(per the guide I pointed out above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I'm having a challenging time getting intellij to enforce this. I'm going to try swinging back around to it after I commit the more sizable change requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is a bit grunting to manually fix the style. I admit. Such change suggestion wouldn't block this PR but should be good to follow the style guide (https://github.com/databricks/scala-style-guide) since Databricks repositories comply this style .. (BTW, I'm not a Databricks guy so I would like to suggest follow this guide to be safe).
@transient | ||
lazy val rowSchema: StructType = schema match { | ||
case st: StructType => st | ||
case ArrayType(st: StructType, _) => st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think XML implementation does not support an array of struct. Can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to have a case _ => throw SomeExceptionClass('string')
statement? Otherwise rowSchema
must be of type DataType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think we can change DataType
to StructType
.
def parseColumn(xml: String, | ||
schema: StructType, | ||
options: XmlOptions): Row = { | ||
val factory: XMLInputFactory = XMLInputFactory.newInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a private function for:
val factory: XMLInputFactory = XMLInputFactory.newInstance()
factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, false)
factory.setProperty(XMLInputFactory.IS_COALESCING, true)
val filter = new EventFilter {
override def accept(event: XMLEvent): Boolean =
// Ignore comments. This library does not treat comments.
event.getEventType != XMLStreamConstants.COMMENT
}
and deduplicate the logic with parse
above.
Also, I would make a function to deduplicate
val eventReader = factory.createXMLEventReader(new StringReader(xml))
val parser = factory.createFilteredReader(eventReader, filter)
val rootEvent =
StaxXmlParserUtils.skipUntil(parser, XMLStreamConstants.START_ELEMENT)
val rootAttributes =
rootEvent.asStartElement.getAttributes.asScala.map(_.asInstanceOf[Attribute]).toArray
convertObject(parser, schema, options, rootAttributes)
as well.
@@ -111,6 +113,30 @@ private[xml] object StaxXmlParser extends Serializable { | |||
} | |||
} | |||
|
|||
def parseColumn(xml: String, | |||
schema: StructType, | |||
options: XmlOptions): Row = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
def parseColumn(
xml: String,
...): Row = {
src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParserUtils.scala
Outdated
Show resolved
Hide resolved
from_xml(df.col("payload"), xmlSchema)) | ||
|
||
assert(expectedSchema == result.schema) | ||
assert(result.where(col("decoded").isNotNull).count() > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual output was also asserted and checked for completeness.
Thanks, @SpaceRangerWes for working on this! I took a quick pass. Should be good to refer https://github.com/databricks/scala-style-guide for code style. |
…ry unit tests, and several quick styling fixes
src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala
Outdated
Show resolved
Hide resolved
def parseColumn(xml: String, | ||
schema: StructType, | ||
options: XmlOptions): Row = { | ||
val factory: XMLInputFactory = XMLInputFactory.newInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently,
- XML string
- XMLInputFactory instance is created
- XML is parsed to
Row
I think this can be avoided by:
class XmlDataToCatalyst(..) {
@transient
lazy val factory: XMLInputFactory = XMLInputFactory.newInstance()
def nullSafeEval(...) = {
factory.createXMLEventReader(new StringReader(xml))
}
}
In this way, we will create one factory first, and then reuse it for every string input. (BTW, it's quite critical to fix this.).
BTW, @SpaceRangerWes, why don't you address #334 (comment), #334 (comment) and #334 (comment) while addressing other comments? Strictly, some codes look not complying the style that this project follow (https://github.com/databricks/scala-style-guide). |
Can I do something to help ? I really need this feature. |
@mantovani you can open a new PR with these changes and address the comments maybe? |
@srowen @mantovani I really apologize. I just don't have the capacity to continue the work on this and have had to abandon it. My team's needs migrated and I could no longer work on this. |
This is a continuation of #334 This implements a `from_xml` expression that can turn an XML string column into a parsed structured column. It also opens up `inferSchema` for a `Dataset[String]` as a necessary support function. The rest is really mild refactoring.
This is a continuation of databricks/spark-xml#334 This implements a `from_xml` expression that can turn an XML string column into a parsed structured column. It also opens up `inferSchema` for a `Dataset[String]` as a necessary support function. The rest is really mild refactoring.
Issue #322 wanted an implementation of a commonly asked for feature. The purpose of this PR is to provide a new function
from_xml
that leverages as much of the current code as possible to parse a string encoded XML column as a complex DataType when provided a schema.Both fortunately and unfortunately, my team at work is pivoting away from any nested XML objects so I was not allowed more time to provide a larger set of enhancements. I'd gladly make any small changes requested, but I will not have the capacity to provide any further enhancements.