-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192
Conversation
The jsonschema produced by relay only covers event["data"] since that is the public facing event. Using event[data] when possible.
The previous PR #1197 broke two tests. I am investigating what the issue is (I suspect the tests are wrong), but this is needed to get a green build
…#1198) * Added Makefile entry to fetch schema; eliminate type errors * Added package name to custom requirement for setuptools * Add 3.7-safe import
…se. (#1201) When I merged master I forgot to remove the duplicate Event type. Now there is only one and it has the right structure.
… properly. (#1202) Removes the jsonschema initially committed since we download the schema during make develop and avoid accidentally merge the schema downloaded during make. Also runs mypy right after downloading the schema. If the schema is not there, malformed or in the wrong place mypy fails with code 2. Then make will fail instead of silently be ok until we run it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool.
The question about .gitignore
is the only thing I'd like more information about before feeling comfortable approving.
.gitignore
Outdated
@@ -7,3 +7,4 @@ | |||
.venv | |||
htmlcov/ | |||
snuba.egg-info/ | |||
schema/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that we should ignore this file from the repository, since that will make it more challenging to go back in the repository history while also being able to reference the correct schema version (and not whatever the latest is, or having to cross-reference commit history between two independent repositories.)
I think that we should check these files in, and ensure that the checked in version is up to date as part of the CI process.
This folder will also presumably include more than one schema eventually — shouldn't it be plural (schemas or schemata?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git submodules! we use them in relay for various things and nobody complained yet. We drive updates through the makefile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether to commit the schema or not is not the real issue.
To me the schema should be imported like all other dependencies, which is as pypi packages, that way:
- we do not need to build custom automation to download the right version of the schema and to maintain it locally
- we do not have to reproduce this same automation for each python repo that needs to load the schema
- it is not up to snuba to decide how to turn the jsonschema into a type manageable by mypy. It is something we can generalize and reuse for all the python repos that need access to the schemas
- we still can track back which version we were relying on at a given point in time.
- we can still build automation to keep requirements.txt up to date if we wanted.
Now I cannot move to pypi packages here in this PR because that package does not exist. Creating it would be my next step. In the meantime I added the update in the post-merge hook so, at least, we are sure we keep the local schema up to date.
What do you think about this approach? If we are on board with the pypi dependency, then I would have no strong opinion regarding checking the schema in till the package is not ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I temporarily committed the schema, while I am working on the pypi package instead of using the submodule because it will be easier to revert and the automation to keep this in sync is simpler considering this is a throw away solution that will not be needed as soon as the schema is packaged properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong preference on separately packaging the schema data. One way or another we're going to have to be able to download and build the schemas, and keep them up to date with a canonical source. I'm not sure it makes much of a difference if both of those steps happen in this package or are split across two packages until there is another project that actually makes use of that shared library. (I'm not sure who else needs this, maybe someone on the data side? Sentry is getting closer to Python 3 but I'm not sure their perspective on typing yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding others needing this:
- the data project. Hoping it will still exist, it is basically a copy paste of our consumer and it is on python 3, they already expressed interest in using it.
- sentry is, as you mentioned, on its way. Even if they do not go all way into typing, I think there would be fairly easy case we could make to type check the code that makes changes to the event objects they receive from outside and that they provide to an external dependency (snuba).
More in general regarding downloading the schema and keeping it up to date, here is my reasoning for mypi:
building the schema: I want to try to build a pypi package that exports the built TypedDict, not just the schema. So no logic to build the schema is needed in snuba (and in the data project). This would be useful especially since we did not really settle down with 100% certainty on using this package for generating the type from jsonschema. There are still alternatives to try.
Not 100% sure this actually is feasible, but I will figure it out in the next days and in the worst case scenario we would just import the schema.
About downloading the schema:
I think here is where pypi is making our lives easier. It is the idiomatic way to import dependencies, so once the right version is in requirements.txt CI will run on that specific version (which means that as soon as we start running mypy in CI that will block any error), clients would be kept up to date with one line only of automation (that we should already have) to keep the installed packages up to date with regards of requirements.txt. It takes care of downloading the file from the right place only when needed without us writing anything. While committing the schema requires us to add automation to download it, validate it and ensure we do not commit changes to the schema by mistake.
About keeping the schema up to date:
it is true that automation is needed in all cases (unless we decide to bump versions manually), though, for mypi, we would just need to automatically create a PR to bump the version in requirements.txt (that can easily be tied to the release of a new version of the schema), which I like because the updates of the schemas would always be isolated from functional PRs and potentially we could have independent versions per schema (with multiple schemas).
For submodules we would have to automate the initial load of the project submodules and we would still need similar automation to generate PRs to pull the submodule and push it back to snuba. So there is the initial setup and then the periodic update.
When committing the schema we would also have to verify no changes are made and committed, which is not needed in the other cases. If we do not just follow master on the schema repo (depends on the release process that is not defined yet) we would still need the additional automation to bump the version name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would still need similar automation to generate PRs to pull the submodule and push it back to snuba
fwiw I think there is virtually no difference between submodules and pypi in this regard, you can use dependabot for both. I guess the difference comes mostly down to release schedule and submodule initialization for local checkout
fetch-and-validate-schema: | ||
mkdir -p schema | ||
curl https://raw.githubusercontent.com/getsentry/sentry-data-schemas/main/relay/event.schema.json -o schema/event.schema.json | ||
mypy snuba > /dev/null || (if [ "$$?" -gt 1 ]; then exit 1; fi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What cases is this trying to check?
Never mind, figured this one out — a short comment here that explains the significance of the exit code would be useful.
@@ -1,9 +1,9 @@ | |||
import logging | |||
from abc import ABC, abstractmethod | |||
from datetime import datetime | |||
from typing import Any, Mapping, MutableMapping, Optional, Sequence | |||
from typing import Any, Mapping, MutableMapping, Optional, Sequence, TypedDict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was thinking last night about going through and cleaning up TypedDict
everywhere since it's standard library in 3.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why it'd be preferable to use a post-merge hook here rather than just enforcing that the schema is up to date as part of the CI process at the time its committed to master.
# Ensures up to data event schemas are downloaded | ||
mkdir -p schemas | ||
curl https://raw.githubusercontent.com/getsentry/sentry-data-schemas/main/relay/event.schema.json -o schemas/event.schema.json | ||
mypy snuba > /dev/null | ||
if [ $? -gt 1 ]; then | ||
echo [${red}${bold}!!!${reset}] Could not dowload schemas. | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just the same as the make step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much the same (the syntax is different because that one is in a Makefile). I added it in both because this happens automatically and you cannot really easily trigger it by yourself (technically you can execute the hook being it a shell script). The Makefile one is something you can always trigger manually.
I don't see CI and post-merge necessarily as alternatives to each other. If CI finds a discrepancy you will need to wait for CI to fail before knonwing the issue and then manually trigger the download of the new schema and push again. Though I think updates to the schema (whether we commit it, use a submodule or use mypi) should not be part of your functional PR. They should be separate commits that are automated when a new release is generated and not impact developers development cycle. |
Turned to draft to verify some assumptions on mypi so I can test it before polishing the existing automation. |
Will resurrect when we get back to this |
[Edit by filippo]
This sets up the basic infrastructure to import the shared jsonschema event schema from https://github.com/getsentry/sentry-data-schemas. It uses
jsonschema-typed-v2
to parse the jsonschema and turn it into a TypedDict mypy uses to typecheck the event structure.This requires to run
make develop
once to download the schema (as it is already the procedure to setup the development environment), otherwise mypy will not find anything to generate the TypedDict. Though it will loudly fail, so it is impossible to miss.This is only a first step to have a proper shared schema. Further steps beyond this PR are:
older description
this is just committing everything I have there for the jsonschema hackweek project. the schema is just verbatim checked into source and snuba now depends on a forked version of a forked version of an unmaintained library (lol)
https://hackweek.getsentry.net/projects/-M5M-fq5BvwqFVewKEE8/document-event-schema-as-json-schema