Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of from_xml function #334

Closed
wants to merge 9 commits into from
Closed

Implementation of from_xml function #334

wants to merge 9 commits into from

Conversation

SpaceRangerWes
Copy link

@SpaceRangerWes SpaceRangerWes commented Oct 29, 2018

Issue #322 wanted an implementation of a commonly asked for feature. The purpose of this PR is to provide a new function from_xml that leverages as much of the current code as possible to parse a string encoded XML column as a complex DataType when provided a schema.

Both fortunately and unfortunately, my team at work is pivoting away from any nested XML objects so I was not allowed more time to provide a larger set of enhancements. I'd gladly make any small changes requested, but I will not have the capacity to provide any further enhancements.

@SpaceRangerWes
Copy link
Author

Note on the build fail:

This is occurring for Scala: 2.10.5 JDK: openjdk7 on a preexisting dependency. I don't know of any of my code that would utilize or break this dependency. Therefore, any dependency tree management is out of scope for this task.

@hinawatts
Copy link

Hi,

I have the similar requirement, where I need to do analysis on a column with xml string. Will be great if you could provide an example on how to use this .

@SpaceRangerWes
Copy link
Author

@hina1234 I added a unit test that shows an example of using what I have written. Looking at it now, the test is poorly name, but it should be what you are looking for https://github.com/SpaceRangerWes/spark-xml/blob/master/src/test/scala/com/databricks/spark/xml/XmlSuite.scala#L919-L941

@HyukjinKwon
Copy link
Member

Sounds good to have it. Can you rebase and make the tests pass?

@codecov-io
Copy link

Codecov Report

Merging #334 into master will decrease coverage by 0.97%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #334      +/-   ##
==========================================
- Coverage   87.51%   86.54%   -0.98%     
==========================================
  Files          14       15       +1     
  Lines         737      773      +36     
  Branches       96       86      -10     
==========================================
+ Hits          645      669      +24     
- Misses         92      104      +12
Impacted Files Coverage Δ
...abricks/spark/xml/parsers/StaxXmlParserUtils.scala 81.81% <0%> (-13.93%) ⬇️
.../main/scala/com/databricks/spark/xml/package.scala 28.57% <100%> (+19.48%) ⬆️
...in/scala/com/databricks/spark/xml/XmlOptions.scala 100% <100%> (ø) ⬆️
...m/databricks/spark/xml/parsers/StaxXmlParser.scala 96.87% <100%> (+0.26%) ⬆️
...a/com/databricks/spark/xml/XmlDataToCatalyst.scala 63.63% <63.63%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0a8b15...ca711d7. Read the comment docs.

@codecov-io
Copy link

codecov-io commented Jan 1, 2019

Codecov Report

Merging #334 into master will decrease coverage by 0.11%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #334      +/-   ##
==========================================
- Coverage   87.58%   87.46%   -0.12%     
==========================================
  Files          14       15       +1     
  Lines         741      766      +25     
  Branches       91       85       -6     
==========================================
+ Hits          649      670      +21     
- Misses         92       96       +4
Impacted Files Coverage Δ
...abricks/spark/xml/parsers/StaxXmlParserUtils.scala 95.74% <ø> (ø) ⬆️
.../main/scala/com/databricks/spark/xml/package.scala 28.57% <100%> (+19.48%) ⬆️
...m/databricks/spark/xml/parsers/StaxXmlParser.scala 96.82% <100%> (+0.21%) ⬆️
...a/com/databricks/spark/xml/XmlDataToCatalyst.scala 63.63% <63.63%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1d4dc...abe16d1. Read the comment docs.

@SpaceRangerWes
Copy link
Author

@HyukjinKwon finally rebased. Sorry that it took so long. The holidays were the only time I had available.


case class XmlDataToCatalyst(child: Expression,
schema: DataType,
options: XmlOptions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

case class XmlDataToCatalyst(
    child: Expression,
    schema: DataType,
    options: XmlOptions)
  extends ...

(per the guide I pointed out above)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I'm having a challenging time getting intellij to enforce this. I'm going to try swinging back around to it after I commit the more sizable change requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it is a bit grunting to manually fix the style. I admit. Such change suggestion wouldn't block this PR but should be good to follow the style guide (https://github.com/databricks/scala-style-guide) since Databricks repositories comply this style .. (BTW, I'm not a Databricks guy so I would like to suggest follow this guide to be safe).

@transient
lazy val rowSchema: StructType = schema match {
case st: StructType => st
case ArrayType(st: StructType, _) => st
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think XML implementation does not support an array of struct. Can be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to have a case _ => throw SomeExceptionClass('string') statement? Otherwise rowSchema must be of type DataType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think we can change DataType to StructType.

def parseColumn(xml: String,
schema: StructType,
options: XmlOptions): Row = {
val factory: XMLInputFactory = XMLInputFactory.newInstance()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a private function for:

    val factory: XMLInputFactory = XMLInputFactory.newInstance()
    factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, false)
    factory.setProperty(XMLInputFactory.IS_COALESCING, true)

    val filter = new EventFilter {
      override def accept(event: XMLEvent): Boolean =
      // Ignore comments. This library does not treat comments.
        event.getEventType != XMLStreamConstants.COMMENT
    }

and deduplicate the logic with parse above.

Also, I would make a function to deduplicate

    val eventReader = factory.createXMLEventReader(new StringReader(xml))
    val parser = factory.createFilteredReader(eventReader, filter)

    val rootEvent =
      StaxXmlParserUtils.skipUntil(parser, XMLStreamConstants.START_ELEMENT)
    val rootAttributes =
      rootEvent.asStartElement.getAttributes.asScala.map(_.asInstanceOf[Attribute]).toArray
    convertObject(parser, schema, options, rootAttributes)

as well.

@@ -111,6 +113,30 @@ private[xml] object StaxXmlParser extends Serializable {
}
}

def parseColumn(xml: String,
schema: StructType,
options: XmlOptions): Row = {
Copy link
Member

@HyukjinKwon HyukjinKwon Jan 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

def parseColumn(
    xml: String,
    ...): Row = {

from_xml(df.col("payload"), xmlSchema))

assert(expectedSchema == result.schema)
assert(result.where(col("decoded").isNotNull).count() > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual output was also asserted and checked for completeness.

@HyukjinKwon
Copy link
Member

Thanks, @SpaceRangerWes for working on this!

I took a quick pass. Should be good to refer https://github.com/databricks/scala-style-guide for code style.

…ry unit tests, and several quick styling fixes
def parseColumn(xml: String,
schema: StructType,
options: XmlOptions): Row = {
val factory: XMLInputFactory = XMLInputFactory.newInstance()
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently,

  1. XML string
  2. XMLInputFactory instance is created
  3. XML is parsed to Row

I think this can be avoided by:

class XmlDataToCatalyst(..) {
  @transient
  lazy val factory: XMLInputFactory = XMLInputFactory.newInstance()

  def nullSafeEval(...) = {
    factory.createXMLEventReader(new StringReader(xml))
  }
}

In this way, we will create one factory first, and then reuse it for every string input. (BTW, it's quite critical to fix this.).

@HyukjinKwon
Copy link
Member

BTW, @SpaceRangerWes, why don't you address #334 (comment), #334 (comment) and #334 (comment) while addressing other comments?

Strictly, some codes look not complying the style that this project follow (https://github.com/databricks/scala-style-guide).

@mantovani
Copy link

Can I do something to help ? I really need this feature.

@srowen
Copy link
Collaborator

srowen commented Mar 12, 2019

@mantovani you can open a new PR with these changes and address the comments maybe?

@SpaceRangerWes
Copy link
Author

@srowen @mantovani I really apologize. I just don't have the capacity to continue the work on this and have had to abandon it. My team's needs migrated and I could no longer work on this.

HyukjinKwon pushed a commit that referenced this pull request Jan 7, 2020
This is a continuation of #334

This implements a `from_xml` expression that can turn an XML string column into a parsed structured column. It also opens up `inferSchema` for a `Dataset[String]` as a necessary support function.

The rest is really mild refactoring.
beluisterql added a commit to beluisterql/spark-xml that referenced this pull request Aug 4, 2024
This is a continuation of databricks/spark-xml#334

This implements a `from_xml` expression that can turn an XML string column into a parsed structured column. It also opens up `inferSchema` for a `Dataset[String]` as a necessary support function.

The rest is really mild refactoring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants