Make easy to load default datasets #269

Mec-iS · 2022-09-06T16:04:37Z

I'm submitting a

feature request.

Current Behaviour:

It is hard to load any of the default datasets.

Expected Behaviour:

there should be a straighforward way of loading existing datasets, for example:

kg = KnowledgeGraph
load_dataset("wikings-families", kg=kg)

Every dataset should have a name that if passed to load_dataset provides automatic imports of the dataset in a given graph; as for example provided by scikit-network load collection

The text was updated successfully, but these errors were encountered:

ceteri · 2022-09-06T23:50:12Z

That's a helpful feature.
It's specific to scikit-network and should be denoted as that in the method name.

Two concerns:

We need to keep our serialization methods following a similar pattern:
- Loads get applied to a graph
- The effects of loads are cumulative (although does this make sense for scikit-network datasets ?)
File locators get passed as PathLike, to allow for working consistently with non-Posix systems, such as cloud storage buckets

Instead I would use a pattern such as:

kg = KnowledgeGraph()
path = pathlib.Path("wikings-families")
kg.load_scikit_dataset(path)

BTW, this reminded me that the cloudpathlib library which our team uses elsewhere has become more general than the urlpath library which we used here in kglab, and we'll need to make that update throughout the serialization methods.

Mec-iS · 2022-09-07T09:23:04Z

It's specific to scikit-network and should be denoted as that in the method name.

No, it is a common pattern used by all the popular libraries, also pytorch and tensorflow provides it for example

The idea is just to encapsulate all this logic:

from os.path import dirname
import kglab
import os

namespaces = {
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gorm": "http://example.org/sagas#",
    "rel":  "http://purl.org/vocab/relationship/",
    }

kg = kglab.KnowledgeGraph(
    name = "Happy Vikings KG example for SKOS/OWL inference",
    namespaces=namespaces,
    )

kg.load_rdf(dirname(dirname(os.getcwd())) + "/dat/gorm.ttl")

into a method, so that the user can avoid knowing all these details.

Accepting your notes that could be:

kg = KnowledgeGraph()
load_dataset("wikings-families", kg=kg, path=None, title=None, namespaces=None)

So that parameters can be passed if needed.

Users will still be able to use kg.load_* explicitly if they need. The new one is just a convenience method for newcomers to quickly load one of the default dataset for experimentation.

ceteri · 2022-09-16T22:15:27Z

Thank you @Mec-iS , that helps me much understand better.

I see about the convenience method, although arguably this is a practice that create extra cognitive load, with PyTorch being an example cited.

For files used in our tutorials we want to emphasize examples of how to load or save files in storage, ideally as Posix files. The thinking is: this way there are less differences to overcome when people try to apply code from our examples for their own projects.

One problem we've encountered during Q&A is that there are namespaces which are difficult to understand, such as the RDF prefix namespace. Moving between different libraries (e.g., RDF vs. NetworkX) also introduces API namespaces to navigate. 'm apprehensive about adding a dataset namespace, since these are only for tutorial example sand not part of the library usage in production.

FWIW, I found this exchange between the fsspec and cloudpathlib communities entertaining :) drivendataorg/cloudpathlib#96

Mec-iS added the good first issue Good for newcomers label Sep 6, 2022

ceteri added this to the Machine Learning integration milestone Sep 6, 2022

ceteri removed the good first issue Good for newcomers label Sep 16, 2022

Mec-iS closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make easy to load default datasets #269

Make easy to load default datasets #269

Mec-iS commented Sep 6, 2022

ceteri commented Sep 6, 2022

Mec-iS commented Sep 7, 2022 •

edited

Loading

ceteri commented Sep 16, 2022

Make easy to load default datasets #269

Make easy to load default datasets #269

Comments

Mec-iS commented Sep 6, 2022

I'm submitting a

Current Behaviour:

Expected Behaviour:

ceteri commented Sep 6, 2022

Mec-iS commented Sep 7, 2022 • edited Loading

ceteri commented Sep 16, 2022

Mec-iS commented Sep 7, 2022 •

edited

Loading