Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use releases information from PyPI, not GitHub #98

Closed
simonw opened this issue Apr 7, 2022 · 23 comments
Closed

Use releases information from PyPI, not GitHub #98

simonw opened this issue Apr 7, 2022 · 23 comments
Labels
enhancement New feature or request plugin directory

Comments

@simonw
Copy link
Owner

simonw commented Apr 7, 2022

I want to start listing more plugins by other people, for example https://github.com/rclement/datasette-dashboards

The site currently uses GitHub releases for release information - but not every project uses GitHub releases.

The PyPI JSON API provides releases in a way that is guaranteed to work for all packages released to PyPI: https://pypi.org/pypi/datasette-dashboards/json

Switching to this as the primary source of information (while still using GitHub releases data, if available, for changelog notes) would help the plugin directory work against a wider range of plugins.

It could even then work for plugins that don't have their source code hosted on GitHub!

@simonw simonw added enhancement New feature or request plugin directory labels Apr 7, 2022
@simonw
Copy link
Owner Author

simonw commented Apr 7, 2022

I'm going to create a separate git scraping repo that records the PyPI JSON for every one of the plugins.

@simonw
Copy link
Owner Author

simonw commented Apr 7, 2022

@simonw
Copy link
Owner Author

simonw commented Apr 7, 2022

simonw added a commit to simonw/pypi-datasette-packages that referenced this issue Apr 7, 2022
@simonw
Copy link
Owner Author

simonw commented Apr 7, 2022

https://github.com/simonw/pypi-datasette-packages/tree/main/packages now contains the JSON for each one of those.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

I built pypi-to-sqlite to help with this.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

My first attempt at integrating pypi-to-sqlite looked like this:

if [ ! -d /tmp/pypi-datasette-packages ]
then
  git clone https://github.com/simonw/pypi-datasette-packages /tmp/pypi-datasette-packages
else
  (cd /tmp/pypi-datasette-packages && git pull)
fi

args=$(sqlite-utils content.db "
  with packages as (
    select name from plugins union select name from tools
    where name not in ('datasette-app')
  )
  select
    group_concat('-f /tmp/pypi-datasette-packages/packages/' || name || '.json', ' ')
  from packages
" --raw)

eval "pypi-to-sqlite content.db ${args} --prefix pypi_"

But this failed with the following error:

Usage: pypi-to-sqlite [OPTIONS] DB_PATH [PACKAGE]...
Try 'pypi-to-sqlite --help' for help.

Error: Invalid value for '-f' / '--file': '/tmp/pypi-datasette-packages/packages/pypi-to-sqlite.json': No such file or directory

Because pypi-to-sqlite isn't tracked by the pypi-datasette-packages GitHub repo yet.

I don't want builds failing because I forgot to scrape a new package.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

This works:

args=$(ls /tmp/pypi-datasette-packages/packages/*.json | awk '{print "-f "$0 " \\"}')
eval "pypi-to-sqlite content.db $args
--prefix pypi_"

If you run echo instead of eval you get (truncated):

pypi-to-sqlite content.db -f /tmp/pypi-datasette-packages/packages/airtable-export.json \
-f /tmp/pypi-datasette-packages/packages/csv-diff.json \
-f /tmp/pypi-datasette-packages/packages/csvs-to-sqlite.json \
-f /tmp/pypi-datasette-packages/packages/datasette-atom.json \
--prefix pypi_

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

simonw added a commit to simonw/pypi-datasette-packages that referenced this issue Apr 8, 2022
@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

I need a de-duped list of releases based on the PyPI data - which is hard because the versions table doesn't have a date in it, have to instead look at the earliest date for one of the releases attached to that version.

This works: https://datasette.io/content?sql=select+package%2C+version%2C+min(upload_time)+as+release_time+from+pypi_releases+group+by+package%2C+version%0D%0Aorder+by+release_time+desc

select
  package,
  version,
  min(upload_time) as release_time
from
  pypi_releases
group by
  package,
  version
order by
  release_time desc

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

Figured out a new query I can use on the homepage that includes releases that do not have release notes on GitHub:

https://datasette.io/content?sql=with+release_note_releases+as+%28%0D%0A++select%0D%0A++++repos.name+as+name%2C%0D%0A++++releases.html_url+as+release_url%2C%0D%0A++++releases.tag_name+as+version%2C%0D%0A++++substr%28releases.published_at%2C+0%2C+11%29+as+date%2C%0D%0A++++releases.body+as+body_markdown%2C%0D%0A++++releases.published_at+as+released_at%2C%0D%0A++++pypi_packages.summary+as+pypi_summary%0D%0A++from%0D%0A++++releases%0D%0A++++join+repos+on+repos.id+%3D+releases.repo%0D%0A++++left+join+pypi_packages+on+pypi_packages.name+%3D+repos.name%0D%0A%29%2C%0D%0Areleases_with_no_notes+as+%28%0D%0A++select%0D%0A++++pypi_releases.package+as+name%2C%0D%0A++++null+as+release_url%2C%0D%0A++++pypi_releases.version+as+version%2C%0D%0A++++substr%28min%28pypi_releases.upload_time%29%2C+0%2C+11%29+as+date%2C%0D%0A++++null+as+body_markdown%2C%0D%0A++++min%28pypi_releases.upload_time%29+as+released_at%2C%0D%0A++++pypi_packages.summary+as+pypi_summary%0D%0A++from%0D%0A++++pypi_releases%0D%0A++++join+pypi_packages+on+pypi_releases.package+%3D+pypi_packages.name%0D%0A++where%0D%0A++++package+not+in+%28%0D%0A++++++select%0D%0A++++++++name%0D%0A++++++from%0D%0A++++++++release_note_releases%0D%0A++++%29%0D%0A++group+by%0D%0A++++pypi_releases.package%2C%0D%0A++++pypi_releases.version%0D%0A%29%0D%0Aselect%0D%0A++*%0D%0Afrom%0D%0A++release_note_releases%0D%0Awhere%0D%0A++name+in+%28%0D%0A++++%27datasette-dashboards%27%2C%0D%0A++++%27pypi-to-sqlite%27%2C%0D%0A++++%27datasette-graphql%27%0D%0A++%29%0D%0A%0D%0Aunion%0D%0Aselect%0D%0A++*%0D%0Afrom%0D%0A++releases_with_no_notes%0D%0Awhere%0D%0A++name+in+%28%0D%0A++++%27datasette-dashboards%27%2C%0D%0A++++%27pypi-to-sqlite%27%2C%0D%0A++++%27datasette-graphql%27%0D%0A++%29%0D%0Aorder+by%0D%0A++released_at+desc%0D%0Alimit%0D%0A++100

with release_note_releases as (
  select
    repos.name as name,
    releases.html_url as release_url,
    releases.tag_name as version,
    substr(releases.published_at, 0, 11) as date,
    releases.body as body_markdown,
    releases.published_at as released_at,
    pypi_packages.summary as pypi_summary
  from
    releases
    join repos on repos.id = releases.repo
    left join pypi_packages on pypi_packages.name = repos.name
),
releases_with_no_notes as (
  select
    pypi_releases.package as name,
    null as release_url,
    pypi_releases.version as version,
    substr(min(pypi_releases.upload_time), 0, 11) as date,
    null as body_markdown,
    min(pypi_releases.upload_time) as released_at,
    pypi_packages.summary as pypi_summary
  from
    pypi_releases
    join pypi_packages on pypi_releases.package = pypi_packages.name
  where
    package not in (
      select
        name
      from
        release_note_releases
    )
  group by
    pypi_releases.package,
    pypi_releases.version
)
select
  *
from
  release_note_releases
where
  name in (
    'datasette-dashboards',
    'pypi-to-sqlite',
    'datasette-graphql'
  )
union
select
  *
from
  releases_with_no_notes
where
  name in (
    'datasette-dashboards',
    'pypi-to-sqlite',
    'datasette-graphql'
  )
order by
  released_at desc
limit
  100

The where in bits there help make the demo more useful.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

I had to hack the data to get datasette-dashboards to show up at the top of the page but I got this:

image

It links to https://datasette.io/plugins/datasette-dashboard where the others link to the GitHub release.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

Next challenge: get https://datasette.io/plugins/datasette-dashboard page working. It currently uses this SQL:

{% set plugins = sql("""
select
plugins.*,
repos.readme_html as readme_html
from
repos
join
plugins on repos.full_name = plugins.full_name
where
repos.name = :name
""", {"name": name}, database="content") %}

e.g. https://datasette.io/content?sql=select%0D%0A++plugins.*%2C%0D%0A++repos.readme_html+as+readme_html%0D%0Afrom%0D%0A++repos%0D%0A++join+plugins+on+repos.full_name+%3D+plugins.full_name%0D%0Awhere%0D%0A++repos.name+%3D+%3Aname&name=datasette-graphql

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

Even bigger challenge: https://datasette.io/plugins - it currently offers "sort by stars" but if I don't have an associated GitHub repository that doesn't make sense.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

I think I need to redefine these two SQL views:

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

Those views are at the heart of how the site things about plugins and tools, and currently have a dependency on tool_repos.yml listing GitHub repos. I should change that to list package names instead, maybe rename it tool_packages.yml.

Relevant SQL snippet from the view definition:

from
  datasette_repos
  join repos on datasette_repos.id = repos.node_id
  join releases on repos.id = releases.repo
  join users on users.id = repos.owner
where
  datasette_repos.nameWithOwner in (
    select
      repo
    from
      tool_repos
  )

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

Maybe I should have a directory.yml file to replace tool_repos.yml and plugin_repos.yml which combines the data from both and provides me with a hook to add custom extras, like a list of tags or a URL to an image (if the repo doesn't provide one).

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

Possible design for that format:

- package: datasette-graphql
  github: simonw/datasette-graphql
  plugin: true
  tags:
  - graphql
  - api
  hooks:
  - database_actions
  - extra_template_vars
  - menu_links
  - register_routes
  - startup
  - table_actions

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

... but that's going to happen in a separate issue. For the moment the problem I want to solve is getting datasette-dashboards listed on the site. It's hosted on GitHub so everything will work, except for the releases - which I've now fixed.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

The reason /plugins and /tools are omitting repos without releases at the moment is that they join against the releases table. Adding a left join helps:

https://datasette.io/content?sql=select%0D%0A++repos.name+as+name%2C%0D%0A++repos.full_name+as+full_name%2C%0D%0A++users.login+as+owner%2C%0D%0A++repos.description+as+description%2C%0D%0A++repos.stargazers_count%2C%0D%0A++releases.tag_name%2C%0D%0A++max(releases.created_at)+as+latest_release_at%2C%0D%0A++repos.created_at+as+created_at%2C%0D%0A++datasette_repos.openGraphImageUrl%2C%0D%0A++datasette_repos.usesCustomOpenGraphImage%2C%0D%0A++(%0D%0A++++select%0D%0A++++++sum(downloads)%0D%0A++++from%0D%0A++++++stats%0D%0A++++where%0D%0A++++++stats.package+%3D+repos.name%0D%0A++++++and+stats.date+%3E+date(%27now%27%2C+%27-7+days%27)%0D%0A++)+as+downloads_this_week%2C%0D%0A++(%0D%0A++++select%0D%0A++++++count(*)%0D%0A++++from%0D%0A++++++plugin_repos%0D%0A++++where%0D%0A++++++repo+%3D+repos.full_name%0D%0A++)+as+is_plugin%2C%0D%0A++(%0D%0A++++select%0D%0A++++++count(*)%0D%0A++++from%0D%0A++++++tool_repos%0D%0A++++where%0D%0A++++++repo+%3D+repos.full_name%0D%0A++)+as+is_tool%0D%0Afrom%0D%0A++datasette_repos%0D%0A++join+repos+on+datasette_repos.id+%3D+repos.node_id%0D%0A++left+join+releases+on+repos.id+%3D+releases.repo%0D%0A++join+users+on+users.id+%3D+repos.owner%0D%0Awhere%0D%0A++datasette_repos.nameWithOwner+in+(%0D%0A++++select%0D%0A++++++repo%0D%0A++++from%0D%0A++++++plugin_repos%0D%0A++)%0D%0Agroup+by%0D%0A++repos.id%0D%0Aorder+by%0D%0A++latest_release_at+desc%3B

Scroll to the bottom of the page to see this:

image

Those two null columns are for tag_name and latest_release_at.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

I'm going to try joining against pypi_releases instead.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

This new query seems to work:

select
  repos.name as name,
  repos.full_name as full_name,
  users.login as owner,
  repos.description as description,
  repos.stargazers_count,
  pypi_versions.name,
  max(pypi_releases.upload_time) as latest_release_at,
  repos.created_at as created_at,
  datasette_repos.openGraphImageUrl,
  datasette_repos.usesCustomOpenGraphImage,
  (
    select
      sum(downloads)
    from
      stats
    where
      stats.package = repos.name
      and stats.date > date('now', '-7 days')
  ) as downloads_this_week,
  (
    select
      count(*)
    from
      plugin_repos
    where
      repo = repos.full_name
  ) as is_plugin,
  (
    select
      count(*)
    from
      tool_repos
    where
      repo = repos.full_name
  ) as is_tool
from
  datasette_repos
  join repos on datasette_repos.id = repos.node_id
  left join pypi_releases on pypi_releases.package = repos.name
  left join pypi_versions on pypi_releases.version = pypi_versions.id
  join users on users.id = repos.owner
where
  datasette_repos.nameWithOwner in (
    select
      repo
    from
      plugin_repos
  )
group by
  repos.id
order by
  latest_release_at desc;

https://datasette.io/content?sql=select%0D%0A++repos.name+as+name%2C%0D%0A++repos.full_name+as+full_name%2C%0D%0A++users.login+as+owner%2C%0D%0A++repos.description+as+description%2C%0D%0A++repos.stargazers_count%2C%0D%0A++pypi_versions.name%2C%0D%0A++max%28pypi_releases.upload_time%29+as+latest_release_at%2C%0D%0A++repos.created_at+as+created_at%2C%0D%0A++datasette_repos.openGraphImageUrl%2C%0D%0A++datasette_repos.usesCustomOpenGraphImage%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++sum%28downloads%29%0D%0A++++from%0D%0A++++++stats%0D%0A++++where%0D%0A++++++stats.package+%3D+repos.name%0D%0A++++++and+stats.date+%3E+date%28%27now%27%2C+%27-7+days%27%29%0D%0A++%29+as+downloads_this_week%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++count%28*%29%0D%0A++++from%0D%0A++++++plugin_repos%0D%0A++++where%0D%0A++++++repo+%3D+repos.full_name%0D%0A++%29+as+is_plugin%2C%0D%0A++%28%0D%0A++++select%0D%0A++++++count%28*%29%0D%0A++++from%0D%0A++++++tool_repos%0D%0A++++where%0D%0A++++++repo+%3D+repos.full_name%0D%0A++%29+as+is_tool%0D%0Afrom%0D%0A++datasette_repos%0D%0A++join+repos+on+datasette_repos.id+%3D+repos.node_id%0D%0A++left+join+pypi_releases+on+pypi_releases.package+%3D+repos.name%0D%0A++left+join+pypi_versions+on+pypi_releases.version+%3D+pypi_versions.id%0D%0A++join+users+on+users.id+%3D+repos.owner%0D%0Awhere%0D%0A++datasette_repos.nameWithOwner+in+%28%0D%0A++++select%0D%0A++++++repo%0D%0A++++from%0D%0A++++++plugin_repos%0D%0A++%29%0D%0Agroup+by%0D%0A++repos.id%0D%0Aorder+by%0D%0A++latest_release_at+desc%3B

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

I broke /tools because it lists datasette-app which isn't a PyPI package and hence has no record there.

@simonw
Copy link
Owner Author

simonw commented Apr 8, 2022

https://datasette.io/tools works fine.

So does https://datasette.io/plugins?q=dash and https://datasette.io/plugins/datasette-dashboards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request plugin directory
Projects
None yet
Development

No branches or pull requests

1 participant