Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make easy to load default datasets #269

Closed
1 task done
Mec-iS opened this issue Sep 6, 2022 · 3 comments
Closed
1 task done

Make easy to load default datasets #269

Mec-iS opened this issue Sep 6, 2022 · 3 comments

Comments

@Mec-iS
Copy link
Contributor

Mec-iS commented Sep 6, 2022

I'm submitting a

  • feature request.

Current Behaviour:

It is hard to load any of the default datasets.

Expected Behaviour:

there should be a straighforward way of loading existing datasets, for example:

kg = KnowledgeGraph
load_dataset("wikings-families", kg=kg)

Every dataset should have a name that if passed to load_dataset provides automatic imports of the dataset in a given graph; as for example provided by scikit-network load collection

@Mec-iS Mec-iS added the good first issue Good for newcomers label Sep 6, 2022
@ceteri ceteri added this to the Machine Learning integration milestone Sep 6, 2022
@ceteri
Copy link
Collaborator

ceteri commented Sep 6, 2022

That's a helpful feature.
It's specific to scikit-network and should be denoted as that in the method name.

Two concerns:

  1. We need to keep our serialization methods following a similar pattern:
    • Loads get applied to a graph
    • The effects of loads are cumulative (although does this make sense for scikit-network datasets ?)
  2. File locators get passed as PathLike, to allow for working consistently with non-Posix systems, such as cloud storage buckets

Instead I would use a pattern such as:

kg = KnowledgeGraph()
path = pathlib.Path("wikings-families")
kg.load_scikit_dataset(path)

BTW, this reminded me that the cloudpathlib library which our team uses elsewhere has become more general than the urlpath library which we used here in kglab, and we'll need to make that update throughout the serialization methods.

@Mec-iS
Copy link
Contributor Author

Mec-iS commented Sep 7, 2022

It's specific to scikit-network and should be denoted as that in the method name.

No, it is a common pattern used by all the popular libraries, also pytorch and tensorflow provides it for example

The idea is just to encapsulate all this logic:

from os.path import dirname
import kglab
import os

namespaces = {
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gorm": "http://example.org/sagas#",
    "rel":  "http://purl.org/vocab/relationship/",
    }

kg = kglab.KnowledgeGraph(
    name = "Happy Vikings KG example for SKOS/OWL inference",
    namespaces=namespaces,
    )

kg.load_rdf(dirname(dirname(os.getcwd())) + "/dat/gorm.ttl")

into a method, so that the user can avoid knowing all these details.

Accepting your notes that could be:

kg = KnowledgeGraph()
load_dataset("wikings-families", kg=kg, path=None, title=None, namespaces=None)

So that parameters can be passed if needed.

Users will still be able to use kg.load_* explicitly if they need. The new one is just a convenience method for newcomers to quickly load one of the default dataset for experimentation.

@ceteri
Copy link
Collaborator

ceteri commented Sep 16, 2022

Thank you @Mec-iS , that helps me much understand better.

I see about the convenience method, although arguably this is a practice that create extra cognitive load, with PyTorch being an example cited.

For files used in our tutorials we want to emphasize examples of how to load or save files in storage, ideally as Posix files. The thinking is: this way there are less differences to overcome when people try to apply code from our examples for their own projects.

One problem we've encountered during Q&A is that there are namespaces which are difficult to understand, such as the RDF prefix namespace. Moving between different libraries (e.g., RDF vs. NetworkX) also introduces API namespaces to navigate. 'm apprehensive about adding a dataset namespace, since these are only for tutorial example sand not part of the library usage in production.

FWIW, I found this exchange between the fsspec and cloudpathlib communities entertaining :) drivendataorg/cloudpathlib#96

@ceteri ceteri removed the good first issue Good for newcomers label Sep 16, 2022
@Mec-iS Mec-iS closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants