Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset info big content error #780

Merged

Conversation

AndreaFrancis
Copy link
Contributor

@AndreaFrancis AndreaFrancis commented Feb 7, 2023

Adding content size validation before trying to insert/update in db.
Fix for #762 and #770
New DATASETS_BASED_CONTENT_MAX_BYTES configuration is added to limit the size of the content result of a worker compute process.

@AndreaFrancis AndreaFrancis marked this pull request as draft February 7, 2023 15:26
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 7, 2023

The documentation is not available anymore as the PR was closed or merged.

@AndreaFrancis AndreaFrancis marked this pull request as ready for review February 7, 2023 15:43
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! thank you :)

@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects:
- `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`.
- `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty.
- `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`.

- `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `100_000_000`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it to the "CACHE_" namespace? It would make more sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will I have to move also that config to CacheConfig? If so, I am afraid to have the post_init connect_to_cache_database running.
Maybe add another namespace but will have to create another config class no handle it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on removing all the "post_init" things from the config: we will have a new "Resource" object, that will be allocated and then released (database, log, storage, etc.)
So, if I finish soon enough, we could move this parameter to the cache config, it would be more logical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per last commit, I kept the logic in the worker

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2023

Codecov Report

Base: 90.69% // Head: 92.16% // Increases project coverage by +1.46% 🎉

Coverage data is based on head (f4131f2) compared to base (3ca72c6).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #780      +/-   ##
==========================================
+ Coverage   90.69%   92.16%   +1.46%     
==========================================
  Files          70       33      -37     
  Lines        3493     2271    -1222     
==========================================
- Hits         3168     2093    -1075     
+ Misses        325      178     -147     
Flag Coverage Δ
services_admin ?
services_api ?
workers_datasets_based 92.16% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...atasets_based/src/datasets_based/worker_factory.py 91.11% <ø> (ø)
.../datasets_based/tests/workers/test_dataset_info.py 100.00% <ø> (ø)
...rkers/datasets_based/tests/workers/test_parquet.py 100.00% <ø> (ø)
workers/datasets_based/tests/workers/test_sizes.py 100.00% <ø> (ø)
...orkers/datasets_based/src/datasets_based/config.py 99.00% <100.00%> (+0.02%) ⬆️
...orkers/datasets_based/src/datasets_based/worker.py 87.93% <100.00%> (+0.65%) ⬆️
...c/datasets_based/workers/_datasets_based_worker.py 100.00% <100.00%> (ø)
workers/datasets_based/tests/conftest.py 90.90% <100.00%> (+0.13%) ⬆️
workers/datasets_based/tests/test_worker.py 100.00% <100.00%> (ø)
workers/datasets_based/tests/test_worker_loop.py 100.00% <100.00%> (ø)
... and 38 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@AndreaFrancis AndreaFrancis force-pushed the dataset-info-big-content-error branch from bf6b956 to 4058200 Compare February 7, 2023 18:05
@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects:
- `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`.
- `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty.
- `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`.

- `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `100_000_000`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on removing all the "post_init" things from the config: we will have a new "Resource" object, that will be allocated and then released (database, log, storage, etc.)
So, if I finish soon enough, we could move this parameter to the cache config, it would be more logical.

@AndreaFrancis AndreaFrancis marked this pull request as draft February 7, 2023 18:22
@severo
Copy link
Collaborator

severo commented Feb 8, 2023

I have changed my mind again 😬. Thinking more about it, I now believe that we want:

  • to handle the MongoDB 16MB limit nicely (currently, it returns UnexpectedError); here, we don't want to reduce this size, why would we limit the storage?
  • to limit the size of the REST API responses, at least some of them, to avoid wasting "bandwidth" and having big latencies.
    • for the lists of things (split names, config names, parquet files, etc.), we can paginate. Additionally, we could use the maximum size (instead of the number of elements) to define where to cut for the next page (not sure if it's a good idea).
    • for the other things (dataset_info, sizes), we should find a solution. It could be:
      • limiting the total size of the response
      • changing the format of the response so that it's small or it's a list of things that can be paginated. For example: /dataset-info is a nested dict that contains a list of features that can exceed the maximum size. Maybe we should not expose /dataset-info as such, but expose /features and /metadata (the rest)... I don't know. Also: maybe /first-rows could be paginated; that way, we can send bigger responses if needed, and the client decides to get more or fewer rows. Beware: it means changing how we store the response in the MongoDB, one response per page?

My ideas on this are not clear, so for now (this PR), I think we can:

  • add a limit on the response size, i.e.: in the worker/datasets_based/ directory (sorry to make you change again)

and create issues to:

  • handle the 16MB limit in MongoDB with a dedicated error
  • paginate the responses
  • change the format of some endpoints

@AndreaFrancis AndreaFrancis force-pushed the dataset-info-big-content-error branch 2 times, most recently from ad98e68 to c0b699c Compare February 8, 2023 16:31
@AndreaFrancis AndreaFrancis marked this pull request as ready for review February 8, 2023 16:34
@AndreaFrancis AndreaFrancis requested a review from severo February 8, 2023 16:51
Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's better that way.

@@ -37,6 +37,7 @@ jobs:
ADMIN_UVICORN_NUM_WORKERS: "2"
ADMIN_UVICORN_PORT: "8081"
COMMON_LOG_LEVEL: "DEBUG"
COMMON_CONTENT_MAX_SIZE : "10_000_000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move it to DATASETS_BASED_CONTENT_MAX_BYTES? (here, and elsewhere in the code: chart/, datasets_based/, tools/)

  • COMMON_ -> DATASETS_BASED_: what we really want would be WORKER_CONTENT_MAX_SIZE (or WORKER_RESPONSE_CONTENT_MAX_SIZE), but we will create this "namespace" when working on Use a generic worker #737, for now DATASETS_BASED is the namespace for all the workers)
  • SIZE -> BYTES: giving the unit avoids errors when modifying the configurations

@@ -47,6 +47,7 @@ common:
hfEndpoint: ""
# Log level
logLevel: "INFO"
contentMaxSize: 1000000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"10_000_000", no?

@@ -61,7 +61,8 @@ common:
hfEndpoint: "https://huggingface.co"
# Log level
logLevel: "INFO"

contentMaxSize: 10_000_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it must be a string, not a number: "10_000_000"

@@ -71,6 +71,7 @@ common:
hfEndpoint: "https://huggingface.co"
# Log level
logLevel: "DEBUG"
contentMaxSize: 10_000_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it must be a string, not a number: "10_000_000"

@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects:
- `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`.
- `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty.
- `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`.

- `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `10_000_000`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be moved to workers/datasets_based/README.md)

what about:

the maximum size in bytes of the response content computed by a worker (to prevent returning big responses in the REST API). Defaults to 10_000_000.

@@ -6,6 +6,7 @@ services:
COMMON_HF_ENDPOINT: ${COMMON_HF_ENDPOINT-https://hub-ci.huggingface.co}
COMMON_HF_TOKEN: ${COMMON_HF_TOKEN-hf_app_datasets-server_token}
COMMON_LOG_LEVEL: ${COMMON_LOG_LEVEL-INFO}
COMMON_CONTENT_MAX_SIZE: ${COMMON_CONTENT_MAX_SIZE-10_00_000}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10_000_000

@AndreaFrancis AndreaFrancis force-pushed the dataset-info-big-content-error branch from 51cf310 to 06b3e81 Compare February 9, 2023 14:04
@AndreaFrancis AndreaFrancis marked this pull request as draft February 9, 2023 14:13
@AndreaFrancis AndreaFrancis marked this pull request as ready for review February 9, 2023 16:30
Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details, and it's OK

@AndreaFrancis AndreaFrancis merged commit afd2e49 into huggingface:main Feb 9, 2023
@AndreaFrancis AndreaFrancis deleted the dataset-info-big-content-error branch February 9, 2023 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants