Bug: CSVParser.getData returns String and not Array #21

valentinoli · 2021-07-19T08:59:08Z

Hello, I noticed a bug in this line:

Line 28 in ca8c372

return this.data[index][selector];

It does not return an Array

Bit of context:
I'm trying to use the ignoreEmptyStrings: true parameter which results in an error at the above line due to this line:

RocketRML/src/input-parser/helper.js

Line 301 in 450f885

return values.filter((v) => v.trim() !== '');

However, even though I fix that bug and apply the parameter, the empty strings are not ignored and the CSV records with empty strings in them are still processed and triple is produced.

The text was updated successfully, but these errors were encountered:

ThibaultGerrier · 2021-07-19T15:46:26Z

Oh yes, ignoreEmptyStrings not working with CSV is a bug.
Fixed with 95f863f & v1.10.2

However, even though I fix that bug and apply the parameter, the empty strings are not ignored and the CSV records with empty strings in them are still processed and triple is produced.

Could you maybe provide a sample CSV, I tried one of my own with empty fields and it worked just fine.

valentinoli · 2021-07-19T18:25:10Z

Great! Thank you

Here is the CSV:

"daddr","time","Tracker Name","Tracker Category"
"020c0d61fbf479582f3978d6591124f8.safeframe.googlesyndication.com","1626039379622","Google","FingerprintingGeneral"
"031c65b448f24f13ef57a0b36248d50d.safeframe.googlesyndication.com","1626037774624","Google","FingerprintingGeneral"
"1842e48d40e3ebd9def3444c0b2b8cbd.safeframe.googlesyndication.com","1625968632859","Google","FingerprintingGeneral"
"192.168.210.1","1625990103523","",""
"20min.ch","1625968616583","",""
"20min.ch","1625968928420","",""

And here is the YARRRML:

base: http://sti2.at/
prefixes:
  ex: http://www.example.com/
  schema: http://schema.org/

sources:
  source: [trackercontrol.csv~csv]

mappings:
  organization:
    sources: source
    s: schema:Organization/$(Tracker Name)
    po:
      - [a, schema:Organization]
      - [schema:name, $(Tracker Name)]

  trace:
    sources: source
    po:
      - [ex:trace/destination, $(daddr)]
      - [ex:trace/time, $(time)]
      - [ex:trace/tracker/category, $(Tracker Category)]
      - p: ex:trace/tracker/organization
        o:
          mapping: organization
          condition:
            - function: equal
              parameters:
                - [str1, $(Tracker Name), s]
                - [str2, $(Tracker Name), o]

which I parsed into the following RML using yarrrml-parser

@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix fnml: <http://semweb.mmlab.be/ns/fnml#>.
@prefix fno: <https://w3id.org/function/ontology#>.
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#>.
@prefix void: <http://rdfs.org/ns/void#>.
@prefix dc: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix : <http://sti2.at/>.
@prefix ex: <http://www.example.com/>.
@prefix schema: <http://schema.org/>.
@prefix idlab-fn: <http://example.com/idlab/function/>.
@prefix grel: <http://users.ugent.be/~bjdmeest/function/grel.ttl#>.

<http://mapping.example.com/rules_000> a void:Dataset.
:source_000 a rml:LogicalSource;
    rdfs:label "source";
    rml:source "./tracker-control/data/trackercontrol_firefox.csv";
    rml:referenceFormulation ql:CSV.
<http://mapping.example.com/rules_000> void:exampleResource :map_organization_000.
:map_organization_000 rml:logicalSource :source_000;
    a rr:TriplesMap;
    rdfs:label "organization".
:s_000 a rr:SubjectMap.
:map_organization_000 rr:subjectMap :s_000.
:s_000 rr:template "http://schema.org/Organization/{Tracker Name}".
:pom_000 a rr:PredicateObjectMap.
:map_organization_000 rr:predicateObjectMap :pom_000.
:pm_000 a rr:PredicateMap.
:pom_000 rr:predicateMap :pm_000.
:pm_000 rr:constant rdf:type.
:pom_000 rr:objectMap :om_000.
:om_000 a rr:ObjectMap;
    rr:constant "http://schema.org/Organization";
    rr:termType rr:IRI.
:pom_001 a rr:PredicateObjectMap.
:map_organization_000 rr:predicateObjectMap :pom_001.
:pm_001 a rr:PredicateMap.
:pom_001 rr:predicateMap :pm_001.
:pm_001 rr:constant schema:name.
:pom_001 rr:objectMap :om_001.
:om_001 a rr:ObjectMap;
    rml:reference "Tracker Name";
    rr:termType rr:Literal.
<http://mapping.example.com/rules_000> void:exampleResource :map_trace_000.
:map_trace_000 rml:logicalSource :source_000;
    a rr:TriplesMap;
    rdfs:label "trace".
:s_001 a rr:SubjectMap.
:map_trace_000 rr:subjectMap :s_001.
:s_001 rr:termType rr:BlankNode.
:pom_002 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_002.
:pm_002 a rr:PredicateMap.
:pom_002 rr:predicateMap :pm_002.
:pm_002 rr:constant <http://www.example.com/trace/destination>.
:pom_002 rr:objectMap :om_002.
:om_002 a rr:ObjectMap;
    rml:reference "daddr";
    rr:termType rr:Literal.
:pom_003 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_003.
:pm_003 a rr:PredicateMap.
:pom_003 rr:predicateMap :pm_003.
:pm_003 rr:constant <http://www.example.com/trace/time>.
:pom_003 rr:objectMap :om_003.
:om_003 a rr:ObjectMap;
    rml:reference "time";
    rr:termType rr:Literal.
:pom_004 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_004.
:pm_004 a rr:PredicateMap.
:pom_004 rr:predicateMap :pm_004.
:pm_004 rr:constant <http://www.example.com/trace/tracker/category>.
:pom_004 rr:objectMap :om_004.
:om_004 a rr:ObjectMap;
    rml:reference "Tracker Category";
    rr:termType rr:Literal.
:pom_005 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_005.
:pm_005 a rr:PredicateMap.
:pom_005 rr:predicateMap :pm_005.
:pm_005 rr:constant <http://www.example.com/trace/tracker/organization>.
:pom_005 rr:objectMap :om_005.
:om_005 a rr:ObjectMap;
    rr:parentTriplesMap :map_organization_000;
    rr:joinCondition :jc_000.
:jc_000 rr:child "Tracker Name";
    rr:parent "Tracker Name".

Then I used RocketRML to produce the following RDF with ignoreEmptyStrings: true

<http://schema.org/Organization/Google> <http://schema.org/name> "Google" .
<http://schema.org/Organization/Google> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:b0 <http://www.example.com/trace/destination> "020c0d61fbf479582f3978d6591124f8.safeframe.googlesyndication.com" .
_:b0 <http://www.example.com/trace/time> "1626039379622" .
_:b0 <http://www.example.com/trace/tracker/category> "FingerprintingGeneral" .
_:b0 <http://www.example.com/trace/tracker/organization> <http://schema.org/Organization/Google> .
_:b1 <http://www.example.com/trace/destination> "031c65b448f24f13ef57a0b36248d50d.safeframe.googlesyndication.com" .
_:b1 <http://www.example.com/trace/time> "1626037774624" .
_:b1 <http://www.example.com/trace/tracker/category> "FingerprintingGeneral" .
_:b1 <http://www.example.com/trace/tracker/organization> <http://schema.org/Organization/Google> .
_:b2 <http://www.example.com/trace/destination> "1842e48d40e3ebd9def3444c0b2b8cbd.safeframe.googlesyndication.com" .
_:b2 <http://www.example.com/trace/time> "1625968632859" .
_:b2 <http://www.example.com/trace/tracker/category> "FingerprintingGeneral" .
_:b2 <http://www.example.com/trace/tracker/organization> <http://schema.org/Organization/Google> .
_:b3 <http://www.example.com/trace/destination> "192.168.210.1" .
_:b3 <http://www.example.com/trace/time> "1625990103523" .
_:b4 <http://www.example.com/trace/destination> "20min.ch" .
_:b4 <http://www.example.com/trace/time> "1625968616583" .
_:b5 <http://www.example.com/trace/destination> "20min.ch" .
_:b5 <http://www.example.com/trace/time> "1625968928420" .

The empty nodes _:b3, _:b4, _:b5 correspond to lines with an empty string in the Tracker Name field

ThibaultGerrier · 2021-07-19T19:24:59Z

The RDF output looks as expected to me, what did you want the output to look like?

Did you want the RDF to not contains b3 - b5 because they do not contain the Tracker Name field? I'm afraid that is not so easy. With JSON or XML it would have been a matter of adding a filter to the iterator (e.g. with xpath /elements["Tracker Name"] or jsonpath $.*[?(@["Tracker Name"])]). But with CSV I'm afraid that is not possible, as CSV by default will iterate over all lines of the document.

As for the ignoreEmptyStrings feature, it's not documented very well, but what it does is ignore values in the input that are empty strings (or only whitespace) - as in, empty strings are treated the same way as if the value was not present.
E.g. with ignoreEmptyStrings:true the following are equal:

{
  "foo": "",
  "bar": "baz"
}

{
  "bar": "baz"
}

while with ignoreEmptyStrings:false (the default) they are not (using foo in the mapping will result with an empty string literal in the RDF)

With CSV it makes less sense than with JSON/XML: does the CSV line: ,,, mean 4 empty strings or 4 undefined values. So far the behavior was to treat them as undefined values, and thus the ignoreEmptyStrings does not have any impact on CSV mappings (besides also ignoring extra whitespace " "," "," ").

It could make sense to actually treat ,,, as empty strings and so with ignoreEmptyStrings:false you'd get empty strings in the RDF. Or was this what you actually wanted? To get empty string literals in the RDF?

valentinoli · 2021-07-19T19:41:57Z

Did you want the RDF to not contains b3 - b5 because they do not contain the Tracker Name field?

Yes, exactly. Thank you for the pointers. I understand why it is not possible for CSV.

Thank you for better explaining ignoreEmptyStrings, it is much clearer now. And you also answered how it operates on CSV data - that's appreciated, in my understanding undefined/empty values in CSV are treated as undefined so they do not appear in the output even though a mapping is declared and regardless of ignoreEmptyStrings
I appreciate your time and the prompt and clear responses!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: CSVParser.getData returns String and not Array #21

Bug: CSVParser.getData returns String and not Array #21

valentinoli commented Jul 19, 2021 •

edited

Loading

ThibaultGerrier commented Jul 19, 2021

valentinoli commented Jul 19, 2021

ThibaultGerrier commented Jul 19, 2021

valentinoli commented Jul 19, 2021 •

edited

Loading

Bug: CSVParser.getData returns String and not Array #21

Bug: CSVParser.getData returns String and not Array #21

Comments

valentinoli commented Jul 19, 2021 • edited Loading

ThibaultGerrier commented Jul 19, 2021

valentinoli commented Jul 19, 2021

ThibaultGerrier commented Jul 19, 2021

valentinoli commented Jul 19, 2021 • edited Loading

valentinoli commented Jul 19, 2021 •

edited

Loading

valentinoli commented Jul 19, 2021 •

edited

Loading