feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192

untitaker · 2020-07-31T09:33:23Z

[Edit by filippo]
This sets up the basic infrastructure to import the shared jsonschema event schema from https://github.com/getsentry/sentry-data-schemas. It uses jsonschema-typed-v2 to parse the jsonschema and turn it into a TypedDict mypy uses to typecheck the event structure.
This requires to run make develop once to download the schema (as it is already the procedure to setup the development environment), otherwise mypy will not find anything to generate the TypedDict. Though it will loudly fail, so it is impossible to miss.

This is only a first step to have a proper shared schema. Further steps beyond this PR are:

well, actually fix all the type errors in the consumer
define a release cycle for https://github.com/getsentry/sentry-data-schemas so we can decide what version snuba should rely on. Whether always take master or to take the latest release.
Revisit whether we should use jsonschema-types-v2 for the long term or others.
provide a schema for all event types.

older description

this is just committing everything I have there for the jsonschema hackweek project. the schema is just verbatim checked into source and snuba now depends on a forked version of a forked version of an unmaintained library (lol)

https://hackweek.getsentry.net/projects/-M5M-fq5BvwqFVewKEE8/document-event-schema-as-json-schema

The jsonschema produced by relay only covers event["data"] since that is the public facing event. Using event[data] when possible.

into wip/jsonschema-mypy

The previous PR #1197 broke two tests. I am investigating what the issue is (I suspect the tests are wrong), but this is needed to get a green build

…#1198) * Added Makefile entry to fetch schema; eliminate type errors * Added package name to custom requirement for setuptools * Add 3.7-safe import

…se. (#1201) When I merged master I forgot to remove the duplicate Event type. Now there is only one and it has the right structure.

… properly. (#1202) Removes the jsonschema initially committed since we download the schema during make develop and avoid accidentally merge the schema downloaded during make. Also runs mypy right after downloading the schema. If the schema is not there, malformed or in the wrong place mypy fails with code 2. Then make will fail instead of silently be ok until we run it later.

tkaemming

This is pretty cool.

The question about .gitignore is the only thing I'd like more information about before feeling comfortable approving.

tkaemming · 2020-09-10T21:43:37Z

.gitignore

@@ -7,3 +7,4 @@
 .venv
 htmlcov/
 snuba.egg-info/
+schema/


I am not sure that we should ignore this file from the repository, since that will make it more challenging to go back in the repository history while also being able to reference the correct schema version (and not whatever the latest is, or having to cross-reference commit history between two independent repositories.)

I think that we should check these files in, and ensure that the checked in version is up to date as part of the CI process.

This folder will also presumably include more than one schema eventually — shouldn't it be plural (schemas or schemata?)

git submodules! we use them in relay for various things and nobody complained yet. We drive updates through the makefile.

I think whether to commit the schema or not is not the real issue.
To me the schema should be imported like all other dependencies, which is as pypi packages, that way:

we do not need to build custom automation to download the right version of the schema and to maintain it locally

we do not have to reproduce this same automation for each python repo that needs to load the schema

it is not up to snuba to decide how to turn the jsonschema into a type manageable by mypy. It is something we can generalize and reuse for all the python repos that need access to the schemas

we still can track back which version we were relying on at a given point in time.

we can still build automation to keep requirements.txt up to date if we wanted.

Now I cannot move to pypi packages here in this PR because that package does not exist. Creating it would be my next step. In the meantime I added the update in the post-merge hook so, at least, we are sure we keep the local schema up to date.

What do you think about this approach? If we are on board with the pypi dependency, then I would have no strong opinion regarding checking the schema in till the package is not ready.

So I temporarily committed the schema, while I am working on the pypi package instead of using the submodule because it will be easier to revert and the automation to keep this in sync is simpler considering this is a throw away solution that will not be needed as soon as the schema is packaged properly.

No strong preference on separately packaging the schema data. One way or another we're going to have to be able to download and build the schemas, and keep them up to date with a canonical source. I'm not sure it makes much of a difference if both of those steps happen in this package or are split across two packages until there is another project that actually makes use of that shared library. (I'm not sure who else needs this, maybe someone on the data side? Sentry is getting closer to Python 3 but I'm not sure their perspective on typing yet.)

Regarding others needing this:

the data project. Hoping it will still exist, it is basically a copy paste of our consumer and it is on python 3, they already expressed interest in using it.

sentry is, as you mentioned, on its way. Even if they do not go all way into typing, I think there would be fairly easy case we could make to type check the code that makes changes to the event objects they receive from outside and that they provide to an external dependency (snuba).

More in general regarding downloading the schema and keeping it up to date, here is my reasoning for mypi:
building the schema: I want to try to build a pypi package that exports the built TypedDict, not just the schema. So no logic to build the schema is needed in snuba (and in the data project). This would be useful especially since we did not really settle down with 100% certainty on using this package for generating the type from jsonschema. There are still alternatives to try.
Not 100% sure this actually is feasible, but I will figure it out in the next days and in the worst case scenario we would just import the schema.

About downloading the schema:
I think here is where pypi is making our lives easier. It is the idiomatic way to import dependencies, so once the right version is in requirements.txt CI will run on that specific version (which means that as soon as we start running mypy in CI that will block any error), clients would be kept up to date with one line only of automation (that we should already have) to keep the installed packages up to date with regards of requirements.txt. It takes care of downloading the file from the right place only when needed without us writing anything. While committing the schema requires us to add automation to download it, validate it and ensure we do not commit changes to the schema by mistake.

About keeping the schema up to date:
it is true that automation is needed in all cases (unless we decide to bump versions manually), though, for mypi, we would just need to automatically create a PR to bump the version in requirements.txt (that can easily be tied to the release of a new version of the schema), which I like because the updates of the schemas would always be isolated from functional PRs and potentially we could have independent versions per schema (with multiple schemas).
For submodules we would have to automate the initial load of the project submodules and we would still need similar automation to generate PRs to pull the submodule and push it back to snuba. So there is the initial setup and then the periodic update.
When committing the schema we would also have to verify no changes are made and committed, which is not needed in the other cases. If we do not just follow master on the schema repo (depends on the release process that is not defined yet) we would still need the additional automation to bump the version name

we would still need similar automation to generate PRs to pull the submodule and push it back to snuba

fwiw I think there is virtually no difference between submodules and pypi in this regard, you can use dependabot for both. I guess the difference comes mostly down to release schedule and submodule initialization for local checkout

tkaemming · 2020-09-10T21:47:52Z

Makefile

+fetch-and-validate-schema:
+	mkdir -p schema
+	curl https://raw.githubusercontent.com/getsentry/sentry-data-schemas/main/relay/event.schema.json -o schema/event.schema.json
+	mypy snuba > /dev/null || (if [ "$$?" -gt 1 ]; then exit 1; fi)


~~What cases is this trying to check?~~

Never mind, figured this one out — a short comment here that explains the significance of the exit code would be useful.

snuba/datasets/events_processor.py

tkaemming · 2020-09-10T21:53:43Z

snuba/datasets/events_processor_base.py

@@ -1,9 +1,9 @@
 import logging
 from abc import ABC, abstractmethod
 from datetime import datetime
-from typing import Any, Mapping, MutableMapping, Optional, Sequence
+from typing import Any, Mapping, MutableMapping, Optional, Sequence, TypedDict


Thanks, I was thinking last night about going through and cleaning up TypedDict everywhere since it's standard library in 3.8.

snuba/datasets/events_processor_base.py

tkaemming

I'm not sure I understand why it'd be preferable to use a post-merge hook here rather than just enforcing that the schema is up to date as part of the CI process at the time its committed to master.

tkaemming · 2020-09-24T17:21:57Z

config/hooks/post-merge

+# Ensures up to data event schemas are downloaded
+mkdir -p schemas
+curl https://raw.githubusercontent.com/getsentry/sentry-data-schemas/main/relay/event.schema.json -o schemas/event.schema.json
+mypy snuba > /dev/null
+if [ $? -gt 1 ]; then
+  echo [${red}${bold}!!!${reset}] Could not dowload schemas.
+fi


Isn't this just the same as the make step?

Pretty much the same (the syntax is different because that one is in a Makefile). I added it in both because this happens automatically and you cannot really easily trigger it by yourself (technically you can execute the hook being it a shell script). The Makefile one is something you can always trigger manually.

fpacifici · 2020-09-24T18:25:02Z

I don't see CI and post-merge necessarily as alternatives to each other. If CI finds a discrepancy you will need to wait for CI to fail before knonwing the issue and then manually trigger the download of the new schema and push again.
If every time you pull or merge master the schema gets updated for you, you only have to stage and commit the changes and CI is less likely to fail to begin with (you may not be up to date only if there was a change between your last pull and your push or if you pulled and forgot to commit the updated file).

Though I think updates to the schema (whether we commit it, use a submodule or use mypi) should not be part of your functional PR. They should be separate commits that are automated when a new release is generated and not impact developers development cycle.

fpacifici · 2020-09-24T18:45:51Z

Turned to draft to verify some assumptions on mypi so I can test it before polishing the existing automation.

fpacifici · 2022-07-07T22:54:26Z

Will resurrect when we get back to this

untitaker and others added 3 commits July 31, 2020 11:32

wip on jsonschema project for snuba

b852ebd

Replace event with data (#1197)

26dd251

The jsonschema produced by relay only covers event["data"] since that is the public facing event. Using event[data] when possible.

Merge branch 'master' into wip/jsonschema-mypy

3ae5ba4

fpacifici changed the title ~~wip on jsonschema project for snuba~~ hackweek: wip on jsonschema project for snuba Aug 4, 2020

fpacifici and others added 8 commits August 4, 2020 13:54

Merge branch 'wip/jsonschema-mypy' of https://github.com/getsentry/snuba

0d216d8

into wip/jsonschema-mypy

Re-fix tests (#1200)

1354941

The previous PR #1197 broke two tests. I am investigating what the issue is (I suspect the tests are wrong), but this is needed to get a green build

Hackweek: Added Makefile entry to fetch schema; eliminate type errors (…

0bfb660

…#1198) * Added Makefile entry to fetch schema; eliminate type errors * Added package name to custom requirement for setuptools * Add 3.7-safe import

fix(hackweek): Use the correct event Typed Dict in event processor ba…

f5823b2

…se. (#1201) When I merged master I forgot to remove the duplicate Event type. Now there is only one and it has the right structure.

Take the schema from the schema repo (#1203)

9fc5fee

Merge branch 'master' into wip/jsonschema-mypy

d11e6dc

Fix import and python version

6ada6bf

fpacifici requested a review from a team September 10, 2020 04:43

fpacifici changed the title ~~hackweek: wip on jsonschema project for snuba~~ feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema cefinition. Sep 10, 2020

fpacifici marked this pull request as ready for review September 10, 2020 04:46

fpacifici changed the title ~~feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema cefinition.~~ feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. Sep 10, 2020

update readme

016323e

tkaemming reviewed Sep 10, 2020

View reviewed changes

tkaemming mentioned this pull request Sep 16, 2020

ref(consumer): Move CDC event snapshot filtering logic from worker to processor #1332

Merged

fpacifici added 8 commits September 21, 2020 22:25

Merge branch 'master' into wip/jsonschema-mypy

fd3d2aa

Code review comments

69c83b1

Add post merge hook

c9fda8e

Post merge hook

5aa582a

Merge branch 'master' into wip/jsonschema-mypy

d415dc0

Temporarily commit the schema

117b37d

Temporarily commit the schema 2

74afa60

Temporarily commit the schema 3

69368b9

tkaemming reviewed Sep 24, 2020

View reviewed changes

fpacifici marked this pull request as draft September 24, 2020 18:44

fpacifici added 4 commits October 1, 2020 15:18

Merge branch 'master' into wip/jsonschema-mypy

e1d7703

Try to import pip

c19c91e

Merge branch 'master' into wip/jsonschema-mypy

bfc8f2f

Add script to update version

60e8e50

fpacifici closed this Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192

feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192

untitaker commented Jul 31, 2020 •

edited by fpacifici

Loading

tkaemming left a comment

tkaemming Sep 10, 2020

untitaker Sep 11, 2020

fpacifici Sep 22, 2020

fpacifici Sep 24, 2020

tkaemming Sep 24, 2020

fpacifici Sep 24, 2020

untitaker Sep 24, 2020

tkaemming Sep 10, 2020

tkaemming Sep 10, 2020

tkaemming left a comment

tkaemming Sep 24, 2020

fpacifici Sep 24, 2020

fpacifici commented Sep 24, 2020

fpacifici commented Sep 24, 2020

fpacifici commented Jul 7, 2022

feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192

feat(event schema) Basic infrastructure to load tpyecheck processor code out of the jsonschema definition. #1192

Conversation

untitaker commented Jul 31, 2020 • edited by fpacifici Loading

older description

tkaemming left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkaemming left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fpacifici commented Sep 24, 2020

fpacifici commented Sep 24, 2020

fpacifici commented Jul 7, 2022

untitaker commented Jul 31, 2020 •

edited by fpacifici

Loading