-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configs and splits #702
Configs and splits #702
Conversation
Codecov ReportBase: 90.42% // Head: 89.11% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #702 +/- ##
==========================================
- Coverage 90.42% 89.11% -1.31%
==========================================
Files 65 36 -29
Lines 3163 1213 -1950
==========================================
- Hits 2860 1081 -1779
+ Misses 303 132 -171
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
It's more explicit, even if it's a bit longer. Also, it prepares the next endpoint /split-names. It will be easier to implement it if it's not named as the existing and to be deprecated /splits
bdebc51
to
690d1cd
Compare
missing config input type in api and admin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learnt a lot, thanks ! I added come comments and questions - mostly for my personal knowledge, I don't have any comment on the code itself :)
}, | ||
"workers": { | ||
"datasets_based": "huggingface/datasets-server-workers-datasets_based:sha-5364f81" | ||
"datasets_based": "huggingface/datasets-server-workers-datasets_based:sha-690d1cd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to update which docker image to deploy right ?
I understand you have to update admin, api and datasets_based (you did changes to all of them) but why do you update mongodbMigration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to update which docker image to deploy right ?
Yes. It's used in the local docker compose, in the CI, and in the kubernetes deploy (dev and prod).
why do you update mongodbMigration
hmmm, good question. As it depends on libcommon and libcommon has been updated, a new docker image has been computed (https://github.com/huggingface/datasets-server/actions/runs/4013659307/jobs/6893262606). It's true that nothing should have changed, but I'm used to ensuring we always run the last versions in order to get quick feedback in case of bugs.
requests: | ||
cpu: 0.01 | ||
limits: | ||
cpu: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in case we want to deploy a dev cluster on ephemeral ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true
DATASETS_BASED_ENDPOINT: "/split-names" # hard-coded | ||
depends_on: | ||
- mongodb | ||
restart: always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use this for a local dev env right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you can launch make start
at the root, and it will use this docker compose file to deploy all the images.
Until now, I've not really used a "local dev env":
- if I want to try something, I generally create a test (be it a unit test if it only affects a service or a worker, or an e2e test if I want to test the relation between them)
- alternately, I just go to one of the services or worker, launch
poetry run python
then interact with the prompt to test the methods. - if I need to test the whole app manually, I deploy on the ephemeral k8s cluster
But now that we are various people working on the project, the last point is not really a good idea anymore, so, it would be better to launch locally when one wants to test the whole app manually.
"/config-names": {"input_type": "dataset"}, | ||
"/split-names": {"input_type": "config", "requires": "/config-names"}, | ||
"/splits": {"input_type": "dataset", "required_by_dataset_viewer": True}, # to be deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where you define the dependencies between the processing steps ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It's a bit hidden for now, and it's hard-coded even if it seems to be a config option. I always thought of this as a temporal workaround, but I'm unsure how we want this to evolve.
@@ -0,0 +1,132 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just curious) why did you choose to add a new /split-names while you could have just extended /splits to support a config= query parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a lot easier to create a new endpoint, populate its cache, then switch (blue/ green method) than having to manage two different logics at the same time.
dataset = hub_public_csv | ||
worker = get_worker(dataset, app_config) | ||
assert worker.process() is True | ||
cached_response = get_response(kind=worker.processing_step.cache_kind, dataset=hub_public_csv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it using an actual mongo database for the cache when running tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes:
- we launch
make up
before launching the test, thenmake down
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/tools/PythonTest.mk#L4-L7 - which corresponds to launching the docker compose described in the
DOCKER_COMPOSE
variable
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/tools/Docker.mk#L1-L7 - which turns to be only a mongo container:
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/services/api/Makefile#L6-L7 - see:
https://github.com/huggingface/datasets-server/blob/e32590450a1f0c33439b41bb5b0b8a44f58fd9db/tools/docker-compose-mongo.yml#L1-L10
Every service or worker has a different mongo container with a different exposed port, so that we can run multiple tests at the same time (just in case).
Also: in the tests, we drop the content of the base before each test.
("gated", False, "ConfigNamesError", "FileNotFoundError"), | ||
("private", False, "ConfigNamesError", "FileNotFoundError"), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I call testing <3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly it's too much and redundant. We could surely optimize the testing time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a couple of minor comments/questions please review if they are valid or not. Also I noticed there is no new test on e2e for the new routes, are they needed?
`SplitNamesResponseContent`: An object with the list of split names for the dataset and config. | ||
<Tip> | ||
Raises the following errors: | ||
- [`~workers.splits.EmptyDatasetError`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be workers.split_names.EmptyDatasetError instead?
Raises the following errors: | ||
- [`~workers.splits.EmptyDatasetError`] | ||
The dataset is empty. | ||
- [`~workers.splits.SplitsNamesError`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be workers.split_names.SplitsNamesError instead?
|
||
@staticmethod | ||
def get_version() -> str: | ||
return "2.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure but since this is a new route, why it is not 1.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/paste from split.py, and I forgot to change. Thanks
Thanks for your comments! I'm merging and deploying. |
In prod:
I'll add an admin endpoint to backfill the cache:
EDIT: See #705, where I introduce a |
I don't know if all the current e2e are needed. I only added to test_11_auth.py, which already checks that all the responses have been created. The unit tests should already cover more details, or if we have integration bugs, we should create a new e2e test for that specifically. But as such, I'm afraid we already have some redundant tests (splits, first rows). |
See #701
This PR does:
/config-names
. It gives the list of config names for a dataset/split-names
. It gives the list of split names for a config (not for a dataset, which is the difference with /splits for now). Note that /split-names, as /splits, may depend on the streaming mode and might fail for this reason./split-names
.Once this PR is merged, the plan is to:
/splits
/splits
processing step, delete the cache for this endpoint, make the endpoint an alias to/split-names
and add the ability for/split-names
to concatenate the responses for all the configs if the only parameter isdataset
(noconfig
passed)