Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List datasets #10

Merged
merged 8 commits into from
Jul 19, 2021
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions src/openml.jl
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,77 @@ function load_List_And_Filter(filters::String; api_key::String = "")
return nothing
end

qualitynames(x) = haskey(x, "name") ? [x["name"]] : []

"""
list_datasets(filter = ""; api_key = "", output_format = NamedTuple)

List OpenML datasets. See [`load_List_And_Filter`](@ref) for the format of
the filter. As an alternative `output_format` one can choose other table types,
like `DataFrame`, if the `DataFrames` package is loaded.

# Examples
```
julia> using DataFrames

julia> ds = MLJOpenML.list_datasets("/tag/OpenML100/", output_format = DataFrame)

julia> sort!(ds, :NumberOfFeatures)
```
"""
function list_datasets(filter = ""; api_key = "", output_format = NamedTuple)
data = MLJOpenML.load_List_And_Filter(filter; api_key)
datasets = data["data"]["dataset"]
qualities = Symbol.(union(vcat([vcat(qualitynames.(entry["quality"])...) for entry in datasets]...)))
result = merge((id = Int[], name = String[], status = String[]),
NamedTuple{tuple(qualities...)}(ntuple(i -> Union{Missing, Int}[], length(qualities))))
for entry in datasets
push!(result.id, entry["did"])
push!(result.name, entry["name"])
push!(result.status, entry["status"])
for quality in entry["quality"]
push!(getproperty(result, Symbol(quality["name"])),
Meta.parse(quality["value"]))
end
for quality in qualities
if length(getproperty(result, quality)) < length(result.id)
push!(getproperty(result, quality), missing)
end
end
end
output_format(result)
end

"""
describe_dataset(id)

Load and show the OpenML description of the data set `id`.
Use [`list_datasets`](@ref) to browse available data sets.

# Examples
```
julia> MLJOpenML.describe_dataset(6)
**Author**: David J. Slate
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Letter+Recognition) - 01-01-1991
**Please cite**: P. W. Frey and D. J. Slate. "Letter Recognition Using Holland-style Adaptive Classifiers". Machine Learning 6(2), 1991

1. TITLE:
Letter Image Recognition Data

The objective is to identify each of a large number of black-and-white
rectangular pixel displays as one of the 26 capital letters in the English
alphabet. The character images were based on 20 different fonts and each
letter within these 20 fonts was randomly distorted to produce a file of
20,000 unique stimuli. Each stimulus was converted into 16 primitive
numerical attributes (statistical moments and edge counts) which were then
scaled to fit into a range of integer values from 0 through 15. We
typically train on the first 16000 items and then use the resulting model
to predict the letter category for the remaining 4000. See the article
cited above for more details.
```
"""
describe_dataset(id) = Text(load_Dataset_Description(id)["data_set_description"]["description"])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of wrapping the string (?) in Text(...), how about applying Markdown.parse(...) to get proper markdown display? You will need to add using Markdown.

Screen Shot 2021-07-14 at 11 02 57 AM

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. I changed this in the latest commit.

# Flow API

# Task API
Expand Down