Skip to content

Commit

Permalink
Add notes about file extensions and _corrupt_record to documentation (#…
Browse files Browse the repository at this point in the history
…674)

* Add notes about file extensions and _corrupt_record to documentation

* Update README.md

* Update README.md
  • Loading branch information
dolfinus authored Dec 22, 2023
1 parent 6969262 commit 3b40ef4
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@ When reading files the API accepts several options:
* `FAILFAST` : throws an exception when it meets corrupted records.
* `inferSchema`: if `true`, attempts to infer an appropriate type for each resulting DataFrame column, like a boolean, numeric or date type. If `false`, all resulting columns are of string type. Default is `true`.
* `columnNameOfCorruptRecord`: The name of new field where malformed strings are stored. Default is `_corrupt_record`.

Note: this field should be present in the dataframe schema if it is passed explicitly, like this:
```python
schema = StructType([StructField("my_field", TimestampType()), StructField("_corrupt_record", StringType())])
spark.read.format("xml").options(rowTag='item').schema(schema).load("file.xml")
```
If schema is infered, this field is added automatically.
* `attributePrefix`: The prefix for attributes so that we can differentiate attributes and elements. This will be the prefix for field names. Default is `_`. Can be empty, but only for reading XML.
* `valueTag`: The tag used for the value when there are attributes in the element having no child. Default is `_VALUE`.
* `charset`: Defaults to 'UTF-8' but can be set to other valid charset names
Expand Down Expand Up @@ -94,6 +101,8 @@ Defaults to [ISO_DATE](https://docs.oracle.com/javase/8/docs/api/java/time/forma

Currently it supports the shortened name usage. You can use just `xml` instead of `com.databricks.spark.xml`.

NOTE: created files have no `.xml` extension.

### XSD Support

Per above, the XML for individual rows can be validated against an XSD using `rowValidationXSDPath`.
Expand Down

0 comments on commit 3b40ef4

Please sign in to comment.