Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: CSVParser.getData returns String and not Array #21

Open
valentinoli opened this issue Jul 19, 2021 · 4 comments
Open

Bug: CSVParser.getData returns String and not Array #21

valentinoli opened this issue Jul 19, 2021 · 4 comments

Comments

@valentinoli
Copy link

valentinoli commented Jul 19, 2021

Hello, I noticed a bug in this line:

return this.data[index][selector];

It does not return an Array

Bit of context:
I'm trying to use the ignoreEmptyStrings: true parameter which results in an error at the above line due to this line:

return values.filter((v) => v.trim() !== '');

However, even though I fix that bug and apply the parameter, the empty strings are not ignored and the CSV records with empty strings in them are still processed and triple is produced.

@ThibaultGerrier
Copy link
Collaborator

Oh yes, ignoreEmptyStrings not working with CSV is a bug.
Fixed with 95f863f & v1.10.2

However, even though I fix that bug and apply the parameter, the empty strings are not ignored and the CSV records with empty strings in them are still processed and triple is produced.

Could you maybe provide a sample CSV, I tried one of my own with empty fields and it worked just fine.

@valentinoli
Copy link
Author

Great! Thank you

Here is the CSV:

"daddr","time","Tracker Name","Tracker Category"
"020c0d61fbf479582f3978d6591124f8.safeframe.googlesyndication.com","1626039379622","Google","FingerprintingGeneral"
"031c65b448f24f13ef57a0b36248d50d.safeframe.googlesyndication.com","1626037774624","Google","FingerprintingGeneral"
"1842e48d40e3ebd9def3444c0b2b8cbd.safeframe.googlesyndication.com","1625968632859","Google","FingerprintingGeneral"
"192.168.210.1","1625990103523","",""
"20min.ch","1625968616583","",""
"20min.ch","1625968928420","",""

And here is the YARRRML:

base: http://sti2.at/
prefixes:
  ex: http://www.example.com/
  schema: http://schema.org/

sources:
  source: [trackercontrol.csv~csv]

mappings:
  organization:
    sources: source
    s: schema:Organization/$(Tracker Name)
    po:
      - [a, schema:Organization]
      - [schema:name, $(Tracker Name)]

  trace:
    sources: source
    po:
      - [ex:trace/destination, $(daddr)]
      - [ex:trace/time, $(time)]
      - [ex:trace/tracker/category, $(Tracker Category)]
      - p: ex:trace/tracker/organization
        o:
          mapping: organization
          condition:
            - function: equal
              parameters:
                - [str1, $(Tracker Name), s]
                - [str2, $(Tracker Name), o]

which I parsed into the following RML using yarrrml-parser

@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix fnml: <http://semweb.mmlab.be/ns/fnml#>.
@prefix fno: <https://w3id.org/function/ontology#>.
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#>.
@prefix void: <http://rdfs.org/ns/void#>.
@prefix dc: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix : <http://sti2.at/>.
@prefix ex: <http://www.example.com/>.
@prefix schema: <http://schema.org/>.
@prefix idlab-fn: <http://example.com/idlab/function/>.
@prefix grel: <http://users.ugent.be/~bjdmeest/function/grel.ttl#>.

<http://mapping.example.com/rules_000> a void:Dataset.
:source_000 a rml:LogicalSource;
    rdfs:label "source";
    rml:source "./tracker-control/data/trackercontrol_firefox.csv";
    rml:referenceFormulation ql:CSV.
<http://mapping.example.com/rules_000> void:exampleResource :map_organization_000.
:map_organization_000 rml:logicalSource :source_000;
    a rr:TriplesMap;
    rdfs:label "organization".
:s_000 a rr:SubjectMap.
:map_organization_000 rr:subjectMap :s_000.
:s_000 rr:template "http://schema.org/Organization/{Tracker Name}".
:pom_000 a rr:PredicateObjectMap.
:map_organization_000 rr:predicateObjectMap :pom_000.
:pm_000 a rr:PredicateMap.
:pom_000 rr:predicateMap :pm_000.
:pm_000 rr:constant rdf:type.
:pom_000 rr:objectMap :om_000.
:om_000 a rr:ObjectMap;
    rr:constant "http://schema.org/Organization";
    rr:termType rr:IRI.
:pom_001 a rr:PredicateObjectMap.
:map_organization_000 rr:predicateObjectMap :pom_001.
:pm_001 a rr:PredicateMap.
:pom_001 rr:predicateMap :pm_001.
:pm_001 rr:constant schema:name.
:pom_001 rr:objectMap :om_001.
:om_001 a rr:ObjectMap;
    rml:reference "Tracker Name";
    rr:termType rr:Literal.
<http://mapping.example.com/rules_000> void:exampleResource :map_trace_000.
:map_trace_000 rml:logicalSource :source_000;
    a rr:TriplesMap;
    rdfs:label "trace".
:s_001 a rr:SubjectMap.
:map_trace_000 rr:subjectMap :s_001.
:s_001 rr:termType rr:BlankNode.
:pom_002 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_002.
:pm_002 a rr:PredicateMap.
:pom_002 rr:predicateMap :pm_002.
:pm_002 rr:constant <http://www.example.com/trace/destination>.
:pom_002 rr:objectMap :om_002.
:om_002 a rr:ObjectMap;
    rml:reference "daddr";
    rr:termType rr:Literal.
:pom_003 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_003.
:pm_003 a rr:PredicateMap.
:pom_003 rr:predicateMap :pm_003.
:pm_003 rr:constant <http://www.example.com/trace/time>.
:pom_003 rr:objectMap :om_003.
:om_003 a rr:ObjectMap;
    rml:reference "time";
    rr:termType rr:Literal.
:pom_004 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_004.
:pm_004 a rr:PredicateMap.
:pom_004 rr:predicateMap :pm_004.
:pm_004 rr:constant <http://www.example.com/trace/tracker/category>.
:pom_004 rr:objectMap :om_004.
:om_004 a rr:ObjectMap;
    rml:reference "Tracker Category";
    rr:termType rr:Literal.
:pom_005 a rr:PredicateObjectMap.
:map_trace_000 rr:predicateObjectMap :pom_005.
:pm_005 a rr:PredicateMap.
:pom_005 rr:predicateMap :pm_005.
:pm_005 rr:constant <http://www.example.com/trace/tracker/organization>.
:pom_005 rr:objectMap :om_005.
:om_005 a rr:ObjectMap;
    rr:parentTriplesMap :map_organization_000;
    rr:joinCondition :jc_000.
:jc_000 rr:child "Tracker Name";
    rr:parent "Tracker Name".

Then I used RocketRML to produce the following RDF with ignoreEmptyStrings: true

<http://schema.org/Organization/Google> <http://schema.org/name> "Google" .
<http://schema.org/Organization/Google> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:b0 <http://www.example.com/trace/destination> "020c0d61fbf479582f3978d6591124f8.safeframe.googlesyndication.com" .
_:b0 <http://www.example.com/trace/time> "1626039379622" .
_:b0 <http://www.example.com/trace/tracker/category> "FingerprintingGeneral" .
_:b0 <http://www.example.com/trace/tracker/organization> <http://schema.org/Organization/Google> .
_:b1 <http://www.example.com/trace/destination> "031c65b448f24f13ef57a0b36248d50d.safeframe.googlesyndication.com" .
_:b1 <http://www.example.com/trace/time> "1626037774624" .
_:b1 <http://www.example.com/trace/tracker/category> "FingerprintingGeneral" .
_:b1 <http://www.example.com/trace/tracker/organization> <http://schema.org/Organization/Google> .
_:b2 <http://www.example.com/trace/destination> "1842e48d40e3ebd9def3444c0b2b8cbd.safeframe.googlesyndication.com" .
_:b2 <http://www.example.com/trace/time> "1625968632859" .
_:b2 <http://www.example.com/trace/tracker/category> "FingerprintingGeneral" .
_:b2 <http://www.example.com/trace/tracker/organization> <http://schema.org/Organization/Google> .
_:b3 <http://www.example.com/trace/destination> "192.168.210.1" .
_:b3 <http://www.example.com/trace/time> "1625990103523" .
_:b4 <http://www.example.com/trace/destination> "20min.ch" .
_:b4 <http://www.example.com/trace/time> "1625968616583" .
_:b5 <http://www.example.com/trace/destination> "20min.ch" .
_:b5 <http://www.example.com/trace/time> "1625968928420" .

The empty nodes _:b3, _:b4, _:b5 correspond to lines with an empty string in the Tracker Name field

@ThibaultGerrier
Copy link
Collaborator

The RDF output looks as expected to me, what did you want the output to look like?

Did you want the RDF to not contains b3 - b5 because they do not contain the Tracker Name field? I'm afraid that is not so easy. With JSON or XML it would have been a matter of adding a filter to the iterator (e.g. with xpath /elements["Tracker Name"] or jsonpath $.*[?(@["Tracker Name"])]). But with CSV I'm afraid that is not possible, as CSV by default will iterate over all lines of the document.

As for the ignoreEmptyStrings feature, it's not documented very well, but what it does is ignore values in the input that are empty strings (or only whitespace) - as in, empty strings are treated the same way as if the value was not present.
E.g. with ignoreEmptyStrings:true the following are equal:

{
  "foo": "",
  "bar": "baz"
}
{
  "bar": "baz"
}

while with ignoreEmptyStrings:false (the default) they are not (using foo in the mapping will result with an empty string literal in the RDF)

With CSV it makes less sense than with JSON/XML: does the CSV line: ,,, mean 4 empty strings or 4 undefined values. So far the behavior was to treat them as undefined values, and thus the ignoreEmptyStrings does not have any impact on CSV mappings (besides also ignoring extra whitespace " "," "," ").

It could make sense to actually treat ,,, as empty strings and so with ignoreEmptyStrings:false you'd get empty strings in the RDF. Or was this what you actually wanted? To get empty string literals in the RDF?

@valentinoli
Copy link
Author

valentinoli commented Jul 19, 2021

Did you want the RDF to not contains b3 - b5 because they do not contain the Tracker Name field?

Yes, exactly. Thank you for the pointers. I understand why it is not possible for CSV.

Thank you for better explaining ignoreEmptyStrings, it is much clearer now. And you also answered how it operates on CSV data - that's appreciated, in my understanding undefined/empty values in CSV are treated as undefined so they do not appear in the output even though a mapping is declared and regardless of ignoreEmptyStrings
I appreciate your time and the prompt and clear responses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants