Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192

Closed
wants to merge 24 commits into from

Conversation

untitaker
Copy link
Member

@untitaker untitaker commented Jul 31, 2020

[Edit by filippo]
This sets up the basic infrastructure to import the shared jsonschema event schema from https://github.com/getsentry/sentry-data-schemas. It uses jsonschema-typed-v2 to parse the jsonschema and turn it into a TypedDict mypy uses to typecheck the event structure.
This requires to run make develop once to download the schema (as it is already the procedure to setup the development environment), otherwise mypy will not find anything to generate the TypedDict. Though it will loudly fail, so it is impossible to miss.

This is only a first step to have a proper shared schema. Further steps beyond this PR are:

  • well, actually fix all the type errors in the consumer
  • define a release cycle for https://github.com/getsentry/sentry-data-schemas so we can decide what version snuba should rely on. Whether always take master or to take the latest release.
  • Revisit whether we should use jsonschema-types-v2 for the long term or others.
  • provide a schema for all event types.

older description

this is just committing everything I have there for the jsonschema hackweek project. the schema is just verbatim checked into source and snuba now depends on a forked version of a forked version of an unmaintained library (lol)

https://hackweek.getsentry.net/projects/-M5M-fq5BvwqFVewKEE8/document-event-schema-as-json-schema

untitaker and others added 3 commits July 31, 2020 11:32
The jsonschema produced by relay only covers event["data"] since that is the public facing event.
Using event[data] when possible.
@fpacifici fpacifici changed the title wip on jsonschema project for snuba hackweek: wip on jsonschema project for snuba Aug 4, 2020
fpacifici and others added 8 commits August 4, 2020 13:54
The previous PR #1197 broke two tests.
I am investigating what the issue is (I suspect the tests are wrong), but this is needed to get a green build
…#1198)

* Added Makefile entry to fetch schema; eliminate type errors
* Added package name to custom requirement for setuptools
* Add 3.7-safe import
…se. (#1201)

When I merged master I forgot to remove the duplicate Event type. Now there is only one and it has the right structure.
… properly. (#1202)

Removes the jsonschema initially committed since we download the schema during make develop and avoid accidentally merge the schema downloaded during make.

Also runs mypy right after downloading the schema. If the schema is not there, malformed or in the wrong place mypy fails with code 2. Then make will fail instead of silently be ok until we run it later.
@fpacifici fpacifici requested a review from a team September 10, 2020 04:43
@fpacifici fpacifici changed the title hackweek: wip on jsonschema project for snuba feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema cefinition. Sep 10, 2020
@fpacifici fpacifici marked this pull request as ready for review September 10, 2020 04:46
@fpacifici fpacifici changed the title feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema cefinition. feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. Sep 10, 2020
Copy link
Contributor

@tkaemming tkaemming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool.

The question about .gitignore is the only thing I'd like more information about before feeling comfortable approving.

.gitignore Outdated
@@ -7,3 +7,4 @@
.venv
htmlcov/
snuba.egg-info/
schema/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that we should ignore this file from the repository, since that will make it more challenging to go back in the repository history while also being able to reference the correct schema version (and not whatever the latest is, or having to cross-reference commit history between two independent repositories.)

I think that we should check these files in, and ensure that the checked in version is up to date as part of the CI process.

This folder will also presumably include more than one schema eventually — shouldn't it be plural (schemas or schemata?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git submodules! we use them in relay for various things and nobody complained yet. We drive updates through the makefile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether to commit the schema or not is not the real issue.
To me the schema should be imported like all other dependencies, which is as pypi packages, that way:

  • we do not need to build custom automation to download the right version of the schema and to maintain it locally
  • we do not have to reproduce this same automation for each python repo that needs to load the schema
  • it is not up to snuba to decide how to turn the jsonschema into a type manageable by mypy. It is something we can generalize and reuse for all the python repos that need access to the schemas
  • we still can track back which version we were relying on at a given point in time.
  • we can still build automation to keep requirements.txt up to date if we wanted.

Now I cannot move to pypi packages here in this PR because that package does not exist. Creating it would be my next step. In the meantime I added the update in the post-merge hook so, at least, we are sure we keep the local schema up to date.

What do you think about this approach? If we are on board with the pypi dependency, then I would have no strong opinion regarding checking the schema in till the package is not ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I temporarily committed the schema, while I am working on the pypi package instead of using the submodule because it will be easier to revert and the automation to keep this in sync is simpler considering this is a throw away solution that will not be needed as soon as the schema is packaged properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference on separately packaging the schema data. One way or another we're going to have to be able to download and build the schemas, and keep them up to date with a canonical source. I'm not sure it makes much of a difference if both of those steps happen in this package or are split across two packages until there is another project that actually makes use of that shared library. (I'm not sure who else needs this, maybe someone on the data side? Sentry is getting closer to Python 3 but I'm not sure their perspective on typing yet.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding others needing this:

  • the data project. Hoping it will still exist, it is basically a copy paste of our consumer and it is on python 3, they already expressed interest in using it.
  • sentry is, as you mentioned, on its way. Even if they do not go all way into typing, I think there would be fairly easy case we could make to type check the code that makes changes to the event objects they receive from outside and that they provide to an external dependency (snuba).

More in general regarding downloading the schema and keeping it up to date, here is my reasoning for mypi:
building the schema: I want to try to build a pypi package that exports the built TypedDict, not just the schema. So no logic to build the schema is needed in snuba (and in the data project). This would be useful especially since we did not really settle down with 100% certainty on using this package for generating the type from jsonschema. There are still alternatives to try.
Not 100% sure this actually is feasible, but I will figure it out in the next days and in the worst case scenario we would just import the schema.

About downloading the schema:
I think here is where pypi is making our lives easier. It is the idiomatic way to import dependencies, so once the right version is in requirements.txt CI will run on that specific version (which means that as soon as we start running mypy in CI that will block any error), clients would be kept up to date with one line only of automation (that we should already have) to keep the installed packages up to date with regards of requirements.txt. It takes care of downloading the file from the right place only when needed without us writing anything. While committing the schema requires us to add automation to download it, validate it and ensure we do not commit changes to the schema by mistake.

About keeping the schema up to date:
it is true that automation is needed in all cases (unless we decide to bump versions manually), though, for mypi, we would just need to automatically create a PR to bump the version in requirements.txt (that can easily be tied to the release of a new version of the schema), which I like because the updates of the schemas would always be isolated from functional PRs and potentially we could have independent versions per schema (with multiple schemas).
For submodules we would have to automate the initial load of the project submodules and we would still need similar automation to generate PRs to pull the submodule and push it back to snuba. So there is the initial setup and then the periodic update.
When committing the schema we would also have to verify no changes are made and committed, which is not needed in the other cases. If we do not just follow master on the schema repo (depends on the release process that is not defined yet) we would still need the additional automation to bump the version name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would still need similar automation to generate PRs to pull the submodule and push it back to snuba

fwiw I think there is virtually no difference between submodules and pypi in this regard, you can use dependabot for both. I guess the difference comes mostly down to release schedule and submodule initialization for local checkout

fetch-and-validate-schema:
mkdir -p schema
curl https://raw.githubusercontent.com/getsentry/sentry-data-schemas/main/relay/event.schema.json -o schema/event.schema.json
mypy snuba > /dev/null || (if [ "$$?" -gt 1 ]; then exit 1; fi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What cases is this trying to check?

Never mind, figured this one out — a short comment here that explains the significance of the exit code would be useful.

snuba/datasets/events_processor.py Show resolved Hide resolved
@@ -1,9 +1,9 @@
import logging
from abc import ABC, abstractmethod
from datetime import datetime
from typing import Any, Mapping, MutableMapping, Optional, Sequence
from typing import Any, Mapping, MutableMapping, Optional, Sequence, TypedDict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was thinking last night about going through and cleaning up TypedDict everywhere since it's standard library in 3.8.

snuba/datasets/events_processor_base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tkaemming tkaemming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why it'd be preferable to use a post-merge hook here rather than just enforcing that the schema is up to date as part of the CI process at the time its committed to master.

Comment on lines +7 to +13
# Ensures up to data event schemas are downloaded
mkdir -p schemas
curl https://raw.githubusercontent.com/getsentry/sentry-data-schemas/main/relay/event.schema.json -o schemas/event.schema.json
mypy snuba > /dev/null
if [ $? -gt 1 ]; then
echo [${red}${bold}!!!${reset}] Could not dowload schemas.
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just the same as the make step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much the same (the syntax is different because that one is in a Makefile). I added it in both because this happens automatically and you cannot really easily trigger it by yourself (technically you can execute the hook being it a shell script). The Makefile one is something you can always trigger manually.

@fpacifici
Copy link
Contributor

I don't see CI and post-merge necessarily as alternatives to each other. If CI finds a discrepancy you will need to wait for CI to fail before knonwing the issue and then manually trigger the download of the new schema and push again.
If every time you pull or merge master the schema gets updated for you, you only have to stage and commit the changes and CI is less likely to fail to begin with (you may not be up to date only if there was a change between your last pull and your push or if you pulled and forgot to commit the updated file).

Though I think updates to the schema (whether we commit it, use a submodule or use mypi) should not be part of your functional PR. They should be separate commits that are automated when a new release is generated and not impact developers development cycle.

@fpacifici fpacifici marked this pull request as draft September 24, 2020 18:44
@fpacifici
Copy link
Contributor

Turned to draft to verify some assumptions on mypi so I can test it before polishing the existing automation.

@fpacifici
Copy link
Contributor

Will resurrect when we get back to this

@fpacifici fpacifici closed this Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants