Implementation of from_xml function #334

SpaceRangerWes · 2018-10-29T17:19:48Z

Issue #322 wanted an implementation of a commonly asked for feature. The purpose of this PR is to provide a new function from_xml that leverages as much of the current code as possible to parse a string encoded XML column as a complex DataType when provided a schema.

Both fortunately and unfortunately, my team at work is pivoting away from any nested XML objects so I was not allowed more time to provide a larger set of enhancements. I'd gladly make any small changes requested, but I will not have the capacity to provide any further enhancements.

SpaceRangerWes · 2018-10-29T17:41:06Z

Note on the build fail:

This is occurring for Scala: 2.10.5 JDK: openjdk7 on a preexisting dependency. I don't know of any of my code that would utilize or break this dependency. Therefore, any dependency tree management is out of scope for this task.

hinawatts · 2018-11-14T10:51:00Z

Hi,

I have the similar requirement, where I need to do analysis on a column with xml string. Will be great if you could provide an example on how to use this .

SpaceRangerWes · 2018-11-19T17:46:57Z

@hina1234 I added a unit test that shows an example of using what I have written. Looking at it now, the test is poorly name, but it should be what you are looking for https://github.com/SpaceRangerWes/spark-xml/blob/master/src/test/scala/com/databricks/spark/xml/XmlSuite.scala#L919-L941

HyukjinKwon · 2018-12-22T10:32:13Z

Sounds good to have it. Can you rebase and make the tests pass?

codecov-io · 2019-01-01T18:50:31Z

Codecov Report

Merging #334 into master will decrease coverage by 0.97%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master     #334      +/-   ##
==========================================
- Coverage   87.51%   86.54%   -0.98%     
==========================================
  Files          14       15       +1     
  Lines         737      773      +36     
  Branches       96       86      -10     
==========================================
+ Hits          645      669      +24     
- Misses         92      104      +12

Impacted Files	Coverage Δ
...abricks/spark/xml/parsers/StaxXmlParserUtils.scala	`81.81% <0%> (-13.93%)`	⬇️
.../main/scala/com/databricks/spark/xml/package.scala	`28.57% <100%> (+19.48%)`	⬆️
...in/scala/com/databricks/spark/xml/XmlOptions.scala	`100% <100%> (ø)`	⬆️
...m/databricks/spark/xml/parsers/StaxXmlParser.scala	`96.87% <100%> (+0.26%)`	⬆️
...a/com/databricks/spark/xml/XmlDataToCatalyst.scala	`63.63% <63.63%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0a8b15...ca711d7. Read the comment docs.

codecov-io · 2019-01-01T18:50:31Z

Codecov Report

Merging #334 into master will decrease coverage by 0.11%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           master     #334      +/-   ##
==========================================
- Coverage   87.58%   87.46%   -0.12%     
==========================================
  Files          14       15       +1     
  Lines         741      766      +25     
  Branches       91       85       -6     
==========================================
+ Hits          649      670      +21     
- Misses         92       96       +4

Impacted Files	Coverage Δ
...abricks/spark/xml/parsers/StaxXmlParserUtils.scala	`95.74% <ø> (ø)`	⬆️
.../main/scala/com/databricks/spark/xml/package.scala	`28.57% <100%> (+19.48%)`	⬆️
...m/databricks/spark/xml/parsers/StaxXmlParser.scala	`96.82% <100%> (+0.21%)`	⬆️
...a/com/databricks/spark/xml/XmlDataToCatalyst.scala	`63.63% <63.63%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1d4dc...abe16d1. Read the comment docs.

SpaceRangerWes · 2019-01-01T18:51:32Z

@HyukjinKwon finally rebased. Sorry that it took so long. The holidays were the only time I had available.

src/main/scala/com/databricks/spark/xml/XmlOptions.scala

src/main/scala/com/databricks/spark/xml/package.scala

src/main/scala/com/databricks/spark/xml/XmlDataToCatalyst.scala

HyukjinKwon · 2019-01-02T06:51:43Z

src/main/scala/com/databricks/spark/xml/XmlDataToCatalyst.scala

+
+case class XmlDataToCatalyst(child: Expression,
+                             schema: DataType,
+                             options: XmlOptions)


nit:

case class XmlDataToCatalyst( child: Expression, schema: DataType, options: XmlOptions) extends ...

(per the guide I pointed out above)

@HyukjinKwon I'm having a challenging time getting intellij to enforce this. I'm going to try swinging back around to it after I commit the more sizable change requests.

Yea, it is a bit grunting to manually fix the style. I admit. Such change suggestion wouldn't block this PR but should be good to follow the style guide (https://github.com/databricks/scala-style-guide) since Databricks repositories comply this style .. (BTW, I'm not a Databricks guy so I would like to suggest follow this guide to be safe).

HyukjinKwon · 2019-01-02T06:52:29Z

src/main/scala/com/databricks/spark/xml/XmlDataToCatalyst.scala

+  @transient
+  lazy val rowSchema: StructType = schema match {
+    case st: StructType => st
+    case ArrayType(st: StructType, _) => st


I think XML implementation does not support an array of struct. Can be removed.

Would you like me to have a case _ => throw SomeExceptionClass('string') statement? Otherwise rowSchema must be of type DataType

Ah, I think we can change DataType to StructType.

HyukjinKwon · 2019-01-02T06:54:00Z

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala

+  def parseColumn(xml: String,
+                  schema: StructType,
+                  options: XmlOptions): Row = {
+    val factory: XMLInputFactory = XMLInputFactory.newInstance()


Let's make a private function for:

val factory: XMLInputFactory = XMLInputFactory.newInstance() factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, false) factory.setProperty(XMLInputFactory.IS_COALESCING, true) val filter = new EventFilter { override def accept(event: XMLEvent): Boolean = // Ignore comments. This library does not treat comments. event.getEventType != XMLStreamConstants.COMMENT }

and deduplicate the logic with parse above.

Also, I would make a function to deduplicate

val eventReader = factory.createXMLEventReader(new StringReader(xml)) val parser = factory.createFilteredReader(eventReader, filter) val rootEvent = StaxXmlParserUtils.skipUntil(parser, XMLStreamConstants.START_ELEMENT) val rootAttributes = rootEvent.asStartElement.getAttributes.asScala.map(_.asInstanceOf[Attribute]).toArray convertObject(parser, schema, options, rootAttributes)

as well.

HyukjinKwon · 2019-01-02T06:54:26Z

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala

@@ -111,6 +113,30 @@ private[xml] object StaxXmlParser extends Serializable {
    }
  }

+  def parseColumn(xml: String,
+                  schema: StructType,
+                  options: XmlOptions): Row = {


nit:

def parseColumn( xml: String, ...): Row = {

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParserUtils.scala

src/test/scala/com/databricks/spark/xml/XmlSuite.scala

HyukjinKwon · 2019-01-02T06:56:30Z

src/test/scala/com/databricks/spark/xml/XmlSuite.scala

+      from_xml(df.col("payload"), xmlSchema))
+
+    assert(expectedSchema == result.schema)
+    assert(result.where(col("decoded").isNotNull).count() > 0)


I think the actual output was also asserted and checked for completeness.

src/test/scala/com/databricks/spark/xml/XmlSuite.scala

src/main/scala/com/databricks/spark/xml/XmlDataToCatalyst.scala

HyukjinKwon · 2019-01-02T07:01:11Z

Thanks, @SpaceRangerWes for working on this!

I took a quick pass. Should be good to refer https://github.com/databricks/scala-style-guide for code style.

…ry unit tests, and several quick styling fixes

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala

…stance.

HyukjinKwon · 2019-02-15T03:05:48Z

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala

+  def parseColumn(xml: String,
+                  schema: StructType,
+                  options: XmlOptions): Row = {
+    val factory: XMLInputFactory = XMLInputFactory.newInstance()


Currently,

XML string

XMLInputFactory instance is created

XML is parsed to Row

I think this can be avoided by:

class XmlDataToCatalyst(..) { @transient lazy val factory: XMLInputFactory = XMLInputFactory.newInstance() def nullSafeEval(...) = { factory.createXMLEventReader(new StringReader(xml)) } }

In this way, we will create one factory first, and then reuse it for every string input. (BTW, it's quite critical to fix this.).

HyukjinKwon · 2019-02-15T03:07:29Z

BTW, @SpaceRangerWes, why don't you address #334 (comment), #334 (comment) and #334 (comment) while addressing other comments?

Strictly, some codes look not complying the style that this project follow (https://github.com/databricks/scala-style-guide).

mantovani · 2019-03-12T18:44:49Z

Can I do something to help ? I really need this feature.

srowen · 2019-03-12T18:46:42Z

@mantovani you can open a new PR with these changes and address the comments maybe?

SpaceRangerWes · 2019-04-03T17:36:20Z

@srowen @mantovani I really apologize. I just don't have the capacity to continue the work on this and have had to abandon it. My team's needs migrated and I could no longer work on this.

This is a continuation of #334 This implements a `from_xml` expression that can turn an XML string column into a parsed structured column. It also opens up `inferSchema` for a `Dataset[String]` as a necessary support function. The rest is really mild refactoring.

This is a continuation of databricks/spark-xml#334 This implements a `from_xml` expression that can turn an XML string column into a parsed structured column. It also opens up `inferSchema` for a `Dataset[String]` as a necessary support function. The rest is really mild refactoring.

srowen mentioned this pull request Dec 22, 2018

UDF to parse a column with an XML string #322

Closed

SpaceRangerWes and others added 6 commits January 1, 2019 10:18

try to enable from_xml. only receiving nulls

606941d

it works!

6380359

add new test case

ebc0169

more unit tests and some clean up

6a08daa

additional cleanup

d3cb603

rebase from upstream and test

ca711d7