Optional skolemize blank nodes on parse #2736

edmondchuc · 2024-03-18T02:24:28Z

I have a use case where I need to preserve the blank node identifiers when loading data into a Graph object. To do this, I'd like an option on the rdflib.Graph.parse method to either provide a custom format (like ntriples-skolem) or a flag on the parse method (skolemize=True) to skolemize blank nodes before adding the statements into the graph.

The reason why this is needed is because RDF blank nodes are scoped to the local document. As soon as it is read into a new system (like an RDFLib graph object), the blank node identifiers are remapped and assigned a new blank node identifier. There's no guarantee that the blank node identifiers are preserved.

Some pseudocode usage:

from rdflib import Graph
from rdflib.compare import isomorphic

skolem_graph = Graph().parse("data.nt", format="ntriples", skolemize=True)
graph = Graph().parse("data.nt", format="ntriples")

assert isomorphic(skolem_graph.de_skolemize(), graph)

# I can use skolem_graph across systems with the blank node identifiers preserved from the original data.nt file.
skolem_graph.serialize(format="ntriples")
...

The text was updated successfully, but these errors were encountered:

WhiteGobo · 2024-03-19T10:24:39Z

I'll look into this. But it seems to me, as we had to work on both the store and on the parser for that.

I havent tried this and im sure there are some problems with that but:
Have you tried other means to skolemize your graph? for example create a skolemized version of your graph per hand an reusing the resulting bnode_context?

Something like this:

from rdflib import Graph
from rdflib.compare import isomorphic

bnode_context_A: MutableMapping[str, BNode] = {}
in_graph = Graph().parse("data.nt", format="ntriples", bnode_context=bnode_context_A)
bnode_context_B = {}
skolem_graph = Graph()
for ax in in_graph:
  for x in ax:
    if x not in bnode_context_B:
      bnode_context_B[x] = skolemize(x)
  skolem_graph.add((bnode_context_B.get(x, x) for x in ax))
bnode_context = {k, bnode_context_B[v] for k, v in bnode_context_A.items()}

graph = Graph().parse("data.nt", format="ntriples")

assert isomorphic(in_graph, graph)

I havent looked into how to get this then to work:

# I can use skolem_graph across systems with the blank node identifiers preserved from the original data.nt file.
skolem_graph.serialize(format="ntriples")

But you should be able to load now with persistent skolemization:

#This sould be the same graph as skolem_graph:
new_graph = Graph().parse("data.nt", format="ntriples", bnode_context=bnode_context)

edmondchuc · 2024-03-20T14:09:24Z

Perhaps this runnable example will explain it clearer.

from rdflib import Graph
from rdflib.compare import isomorphic

data = """
    <urn:object> <urn:hasPart> _:internal-bnode-id-1 .
    _:internal-bnode-id-1 <urn:value> "..." .
"""

skolem_graph = Graph().parse(data=data, format="ntriples").skolemize()
graph = Graph().parse(data=data, format="ntriples")

assert isomorphic(skolem_graph.de_skolemize(), graph)

# The output should contain the skolem IRI
# <https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1>
# but instead, we get something like:
#
#     <https://rdflib.github.io/.wellknown/genid/rdflib/N19d54f84f7e84ba8a270ddb627e92cdb> <urn:value> "..." .
#     <urn:object> <urn:hasPart> <https://rdflib.github.io/.well-known/genid/rdflib/N19d54f84f7e84ba8a270ddb627e92cdb> .
#
# where N19d54f84f7e84ba8a270ddb627e92cdb is the remapped blank node id by RDFLib.
skolem_graph.print(format="ntriples")

If we are able to skolemize blank nodes at parse time, we should expect an output like this:

<urn:object> <urn:hasPart> <https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1> .
<https://rdflib.github.io/.well-known/genid/rdflib/internal-bnode-id-1> <urn:value> "..." .

Essentially, without a change to the logic at parse time, it's impossible to skolemize blank nodes and preserve the identifiers in the original data.

WhiteGobo · 2024-03-25T20:51:43Z

Would it be enough to use an identity mapping for bnode_context?

from rdflib import Graph, BNode
from rdflib.compare import isomorphic

data = """
    <urn:object> <urn:hasPart> _:internal-bnode-id-1 .
    _:internal-bnode-id-1 <urn:value> "..." .
"""

from typing import MutableMapping
class IdMap(MutableMapping[str, BNode]):
    def __init__(self, dct=None):
        self.dct = {} if dct is None else dct 

    def __getitem__(self, key: str) -> BNode:
        return self.dct.setdefault(key, BNode(key))

    def __setitem__(self, key: str, value: BNode):
        self.dct[key] = value

    def __delitem__(self, key: str):
        return self.dct.__delitem__(key)

    def __iter__(self):
        return iter(self.dct)

    def __len__(self) -> int:
        return len(self.dct)


skolem_graph = Graph().parse(data=data, format="ntriples", bnode_context=IdMap())
for x in skolem_graph:
    print(x)

Im not sure how to make a transparent implemention of skolemization during parsing. I would rather invest time into the documentation of skolemization in rdflib and have a recipe of this somewhere.

edmondchuc · 2024-07-01T10:17:01Z

Thank you @WhiteGobo for your example. I dug into the code a bit and looked into the history of why the bnode_context was added and you're right, this allows us to preserve the blank node identifiers across multiple parses.

I agree with you, the current API is not very transparent, and yes, it would be nice to have recipes, but I still think a change to the API is beneficial here.

edmondchuc mentioned this issue Jul 2, 2024

Add skolemization support for ntriples, nquads, hextuples and json-ld support at parse time #2816

Merged

8 tasks

nicholascar closed this as completed in #2816 Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional skolemize blank nodes on parse #2736

Optional skolemize blank nodes on parse #2736

edmondchuc commented Mar 18, 2024

WhiteGobo commented Mar 19, 2024

edmondchuc commented Mar 20, 2024

WhiteGobo commented Mar 25, 2024 •

edited

Loading

edmondchuc commented Jul 1, 2024

Optional skolemize blank nodes on parse #2736

Optional skolemize blank nodes on parse #2736

Comments

edmondchuc commented Mar 18, 2024

WhiteGobo commented Mar 19, 2024

edmondchuc commented Mar 20, 2024

WhiteGobo commented Mar 25, 2024 • edited Loading

edmondchuc commented Jul 1, 2024

WhiteGobo commented Mar 25, 2024 •

edited

Loading