Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML file matadata #125

Merged
merged 11 commits into from
Apr 10, 2024
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
## [Unreleased]
=======
* no unreleased changes *

### Added
* XML file/table metadata storage

## 11.1.0 / 2024-03-07
### Added
Expand Down
50 changes: 50 additions & 0 deletions docs/xml-file-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
layout: page
title: XML File Netdata
permalink: /xml-file-metadata/
---

### Introduction
XML can contain file level data, `NdrImport::Xml::Table` now supports retrieval and storage of that data.

### `xml_file_metadata`
* `NdrImport::Xml::Table` can optionally store `xml_file_metadata`. This is a hash of { attribute name => xpath }.
* The `NdrImport::File::Xml` handler uses `xml_file_metadata` to locate the metadata from within the file, then sets the `file_metadata` attribute as a hash of { attribute name => value at given xpath }.
* The `UniversalImporterHelper` then assigns the handler.file_metadata to the `NdrImport::Table` attribute `table_metadata`, which can then be accessed downstream.


### Example:
Given the below example data:

```xml
<root>
<metatadata_one extension="hello"/>
<metatadata_two value="world"/>
<record>
<some_data>DOUGLAS</some_data>
</record>
<record>
<some_data>DORA</some_data>
</record>
<root>
```

The `NdrImport::Xml::Table` mapping might look like:

```yaml
- !ruby/object:NdrImport::Xml::Table
filename_pattern: !ruby/regexp //
format: xml_table
xml_record_xpath: 'record'
yield_xml_record: false
xml_file_metadata:
metatadata_one: '//root/metatadata_one/@extension'
metatadata_two: '//root/metatadata_two/@value'
columns:
...
```

This would result in a `table_metadata` value of:
```
{ metatadata_one: 'hello', metatadata_two: 'world' }
```
2 changes: 1 addition & 1 deletion docs/xml-mappings.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The logic covers all current use cases; additional features may be needed if mor
* `pattern_match_record_xpath` - setting this `true` treats the `xml_record_xpath` as a regular expression; the default is to treat it as a string
* `slurp` - setting this to `true` will ensure the data is slurped; the default is to stream the XML
* `yield_xml_record` - setting this to true will yield all "klasses" created from a single XML record (identified by `xml_record_xpath`); the default is to yield per klass

* `xml_file_metadata` - [See xml file metadata](xml-file-metadata.md)

### `NdrImport::Xml::Table` example:
Given the below example data:
Expand Down
12 changes: 7 additions & 5 deletions lib/ndr_import/file/base.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ module NdrImport
module File
# All common base file handler logic is defined here.
class Base
attr_accessor :file_metadata

def initialize(filename, format, options = {})
@filename = filename
@format = format
Expand All @@ -32,10 +34,10 @@ def files
yield @filename
end

# This method iterates over the tables in the given file and yields with two arguments:
# a tablename and a row enumerator (for that table). For a spreadsheet it may yield for
# every worksheet in the file and for a CSV file it will only yield once (the entire
# file is one table).
# This method iterates over the tables in the given file and yields with three arguments:
# a tablename, a row enumerator (for that table) and any file metadata.
# For a spreadsheet it may yield for every worksheet in the file and for a CSV file it
# will only yield once (the entire file is one table).
#
# As single table files are in the majority, the Base implementation is defined for
# single table handlers and you will only need to implement the rows iterator. If your
Expand All @@ -45,7 +47,7 @@ def files
def tables
return enum_for(:tables) unless block_given?

yield nil, rows
yield nil, rows, file_metadata
end

private
Expand Down
65 changes: 60 additions & 5 deletions lib/ndr_import/file/xml.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,87 @@ def initialize(*)
super

@pattern_match_xpath = @options['pattern_match_record_xpath']
@xml_file_metadata = @options['xml_file_metadata']
bshand marked this conversation as resolved.
Show resolved Hide resolved
@options['slurp'] ? prepare_slurped_file : prepare_streamed_file
end

private

def prepare_slurped_file
@doc = read_xml_file(@filename)
slurp_metadata_values
end

def prepare_streamed_file
with_encoding_check(@filename) do |stream, encoding|
@stream = stream
@encoding = encoding
end
stream_metadata_values
end

def slurp_metadata_values
return unless @xml_file_metadata.is_a?(Hash)

self.file_metadata = @xml_file_metadata.transform_values do |xpath|
@doc.xpath(xpath).inner_text
end
end

def stream_metadata_values
return unless @xml_file_metadata.is_a?(Hash)

self.file_metadata = @xml_file_metadata.transform_values.with_index do |xpath, index|
# Ensure we're at the start of the stream each time
@stream.rewind unless index.zero?
bshand marked this conversation as resolved.
Show resolved Hide resolved

metadata_from_stream(xpath)
end
end

def metadata_from_stream(xpath)
cursor = Cursor.new(xpath, false)

# If markup isn't well-formed, try to work around it:
options = Nokogiri::XML::ParseOptions::RECOVER
reader = Nokogiri::XML::Reader(@stream, nil, @encoding, options)

reader.each do |node|
case node.node_type
when Nokogiri::XML::Reader::TYPE_ELEMENT # "opening tag"
raise NestingError, node if cursor.in?(node)

cursor.enter(node)
return cursor.inner_text if cursor.send(:current_stack_match?)
when Nokogiri::XML::Reader::TYPE_END_ELEMENT # "closing tag"
cursor.leave(node)
end
end
end

# Iterate through the file, yielding each 'xml_record_xpath' element in turn.
def rows(&block)
return enum_for(:rows) unless block

if @options['slurp']
record_elements(read_xml_file(@filename)).each(&block)
record_elements.each(&block)
else
each_node(@filename, xml_record_xpath, @pattern_match_xpath, &block)
@stream.rewind
each_node(@stream, @encoding, xml_record_xpath, @pattern_match_xpath, &block)
end
end

def xml_record_xpath
@pattern_match_xpath ? @options['xml_record_xpath'] : "*/#{@options['xml_record_xpath']}"
end

def record_elements(doc)
def record_elements
if @pattern_match_xpath
doc.root.children.find_all do |element|
@doc.root.children.find_all do |element|
element.name =~ Regexp.new(@options['xml_record_xpath'])
end
else
doc.root.xpath(@options['xml_record_xpath'])
@doc.root.xpath(@options['xml_record_xpath'])
end
end
end
Expand Down
15 changes: 8 additions & 7 deletions lib/ndr_import/helpers/file/xml_streaming.rb
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,10 @@ def matches?
match
end

def inner_text
dom_stubs[@stack].xpath(@xpath)&.inner_text
end

private

def in_empty_element?
Expand Down Expand Up @@ -134,21 +138,18 @@ def add_items_to_dom(dom, items)

include UTF8Encoding

# Streams the contents of the given `safe_path`, and yields
# each element matching `xpath` as they're found.
# Yields each element matching `xpath` from `stream` as they're found.
#
# In the case of dodgy encoding, may fall back to slurping the
# file, but will still use stream parsing for XML.
#
# Optionally pattern match the xpath
def each_node(safe_path, xpath, pattern_match_xpath = nil, &block)
return enum_for(:each_node, safe_path, xpath, pattern_match_xpath) unless block
def each_node(stream, encoding, xpath, pattern_match_xpath = nil, &block)
return enum_for(:each_node, stream, encoding, xpath, pattern_match_xpath) unless block

require 'nokogiri'

with_encoding_check(safe_path) do |stream, encoding|
stream_xml_nodes(stream, xpath, pattern_match_xpath, encoding, &block)
end
stream_xml_nodes(stream, xpath, pattern_match_xpath, encoding, &block)
end

private
Expand Down
2 changes: 1 addition & 1 deletion lib/ndr_import/table.rb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ def all_valid_options
end

attr_reader(*all_valid_options)
attr_accessor :notifier
attr_accessor :notifier, :table_metadata

def initialize(options = {})
options.stringify_keys! if options.is_a?(Hash)
Expand Down
7 changes: 4 additions & 3 deletions lib/ndr_import/universal_importer_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ def extract(source_file, &block)
'xml_record_xpath' => table_mapping.try(:xml_record_xpath),
'slurp' => table_mapping.try(:slurp),
'yield_xml_record' => table_mapping.try(:yield_xml_record),
'pattern_match_record_xpath' => table_mapping.try(:pattern_match_record_xpath) }
'pattern_match_record_xpath' => table_mapping.try(:pattern_match_record_xpath),
'xml_file_metadata' => table_mapping.try(:xml_file_metadata) }

tables = NdrImport::File::Registry.tables(filename, table_mapping.try(:format), options)
yield_tables_and_their_content(filename, tables, &block)
Expand All @@ -71,12 +72,12 @@ def extract(source_file, &block)
def yield_tables_and_their_content(filename, tables, &block)
return enum_for(:yield_tables_and_their_content, filename, tables) unless block_given?

tables.each do |tablename, table_content|
tables.each do |tablename, table_content, file_metadata|
mapping = get_table_mapping(filename, tablename)
next if mapping.nil?

mapping.notifier = get_notifier(record_total(filename, table_content))

mapping.table_metadata = file_metadata || {}
yield(mapping, table_content)
end
end
Expand Down
3 changes: 2 additions & 1 deletion lib/ndr_import/xml/table.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ class Table < ::NdrImport::Table
require 'ndr_import/xml/column_mapping'
require 'ndr_import/xml/masked_mappings'

XML_OPTIONS = %w[pattern_match_record_xpath xml_record_xpath yield_xml_record].freeze
XML_OPTIONS = %w[pattern_match_record_xpath xml_file_metadata xml_record_xpath
yield_xml_record].freeze

def self.all_valid_options
super - %w[delimiter header_lines footer_lines] + XML_OPTIONS
Expand Down
52 changes: 52 additions & 0 deletions test/file/xml_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,58 @@ def setup
assert rows.is_a? Enumerator
assert_equal 0, rows.to_a.length
end

test 'should read file metadata while slurping xml' do
file_path = SafePath.new('permanent_test_files').join('complex_xml.xml')
options = {
'xml_record_xpath' => 'BreastRecord',
'slurp' => true,
'xml_file_metadata' => {
'submitting_providercode' => '//OrganisationIdentifierCodeOfSubmittingOrganisation/@extension'
}
}
handler = NdrImport::File::Xml.new(file_path, nil, options)
expected_metadata = { 'submitting_providercode' => 'LT4' }
assert_equal expected_metadata, handler.file_metadata

tables = handler.send(:tables).to_a
assert_equal expected_metadata, tables.first.last
end

test 'should read file metadata while streaming xml' do
file_path = SafePath.new('permanent_test_files').join('complex_xml.xml')
options = {
'xml_record_xpath' => 'BreastRecord',
'slurp' => false,
'xml_file_metadata' => {
'submitting_providercode' => '//COSD:OrganisationIdentifierCodeOfSubmittingOrganisation/@extension',
'record_count' => '//COSD:RecordCount/@value'
}
}
handler = NdrImport::File::Xml.new(file_path, nil, options)
expected_metadata = { 'submitting_providercode' => 'LT4', 'record_count' => '6349923' }
assert_equal expected_metadata, handler.file_metadata
tables = handler.send(:tables).to_a
assert_equal expected_metadata, tables.first.last
end

test 'should identify forced encoding when preparing file stream' do
handler = NdrImport::File::Xml.new(@file_path, nil, { slurp: false })
assert_nil handler.instance_variable_get('@encoding')

file_path = SafePath.new('permanent_test_files').join('utf-16be_xml_with_declaration.xml')
handler = NdrImport::File::Xml.new(file_path, nil, { slurp: false })
assert_equal 'UTF8', handler.instance_variable_get('@encoding')
end

test 'should not identify forced encoding when slurping file' do
handler = NdrImport::File::Xml.new(@file_path, nil, { slurp: true })
assert_nil handler.instance_variable_get('@encoding')

file_path = SafePath.new('permanent_test_files').join('utf-16be_xml_with_declaration.xml')
handler = NdrImport::File::Xml.new(file_path, nil, { slurp: true })
assert_nil handler.instance_variable_get('@encoding')
end
end
end
end
14 changes: 10 additions & 4 deletions test/helpers/file/xml_streaming_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ def nodes(xpath, xml)
def nodes_from_file(xpath, file_name)
file_path = safe_path.join(file_name)
[].tap do |nodes|
each_node(file_path, xpath) { |node| nodes << node }
with_encoding_check(file_path) do |stream, encoding|
each_node(stream, encoding, xpath) { |node| nodes << node }
end
end
end
end
Expand Down Expand Up @@ -199,10 +201,10 @@ def setup
end
end

test 'each_node should reject non safe path arguments' do
test 'with_encoding_check should reject non safe path arguments' do
exception = assert_raises ArgumentError do
block_called = false
@importer.each_node('unsafe.xml', '//note') { block_called = true }
@importer.send(:with_encoding_check, 'unsafe.xml') { block_called = true }

refute block_called, 'should not have yielded'
end
Expand All @@ -211,7 +213,11 @@ def setup
end

test 'each_node should return an enumerable' do
enum = @importer.each_node(@importer.safe_path.join('utf-8_xml.xml'), '//note')
safe_path = @importer.safe_path.join('utf-8_xml.xml')
enum = nil
@importer.send(:with_encoding_check, safe_path) do |stream, encoding|
enum = @importer.each_node(stream, encoding, '//note')
end
assert_kind_of Enumerator, enum
assert_equal 2, enum.to_a.length
end
Expand Down
2 changes: 2 additions & 0 deletions test/resources/complex_xml.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
<COSD:COSD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:COSD="http://www.datadictionary.nhs.uk/messages/COSD-v10-0-1">
<OrganisationIdentifierCodeOfSubmittingOrganisation extension="LT4"/>
<RecordCount value="6349923"/>
<BreastRecord>
<Id root="AAD0ce33-d168-26c6-A4Df-7A0B4d2a5aeC"/>
<PatientIdentityDetails>
Expand Down
Loading
Loading