Dataset info big content error #780

AndreaFrancis · 2023-02-07T15:26:20Z

Adding content size validation before trying to insert/update in db.
Fix for #762 and #770
New DATASETS_BASED_CONTENT_MAX_BYTES configuration is added to limit the size of the content result of a worker compute process.

HuggingFaceDocBuilderDev · 2023-02-07T15:29:40Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Awesome ! thank you :)

libs/libcommon/src/libcommon/config.py

.github/workflows/_e2e_tests.yml

severo · 2023-02-07T15:53:06Z

libs/libcommon/README.md

@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects:
 - `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`.
 - `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty.
 - `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`.
-
+-  `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `100_000_000`


Could you move it to the "CACHE_" namespace? It would make more sense

Will I have to move also that config to CacheConfig? If so, I am afraid to have the post_init connect_to_cache_database running.
Maybe add another namespace but will have to create another config class no handle it

I'm currently working on removing all the "post_init" things from the config: we will have a new "Resource" object, that will be allocated and then released (database, log, storage, etc.)
So, if I finish soon enough, we could move this parameter to the cache config, it would be more logical.

As per last commit, I kept the logic in the worker

libs/libcommon/src/libcommon/config.py

workers/datasets_based/src/datasets_based/worker.py

codecov-commenter · 2023-02-07T18:00:57Z

Codecov Report

Base: 90.69% // Head: 92.16% // Increases project coverage by +1.46% 🎉

Coverage data is based on head (f4131f2) compared to base (3ca72c6).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #780      +/-   ##
==========================================
+ Coverage   90.69%   92.16%   +1.46%     
==========================================
  Files          70       33      -37     
  Lines        3493     2271    -1222     
==========================================
- Hits         3168     2093    -1075     
+ Misses        325      178     -147

Flag	Coverage Δ
services_admin	`?`
services_api	`?`
workers_datasets_based	`92.16% <100.00%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...atasets_based/src/datasets_based/worker_factory.py	`91.11% <ø> (ø)`
.../datasets_based/tests/workers/test_dataset_info.py	`100.00% <ø> (ø)`
...rkers/datasets_based/tests/workers/test_parquet.py	`100.00% <ø> (ø)`
workers/datasets_based/tests/workers/test_sizes.py	`100.00% <ø> (ø)`
...orkers/datasets_based/src/datasets_based/config.py	`99.00% <100.00%> (+0.02%)`	⬆️
...orkers/datasets_based/src/datasets_based/worker.py	`87.93% <100.00%> (+0.65%)`	⬆️
...c/datasets_based/workers/_datasets_based_worker.py	`100.00% <100.00%> (ø)`
workers/datasets_based/tests/conftest.py	`90.90% <100.00%> (+0.13%)`	⬆️
workers/datasets_based/tests/test_worker.py	`100.00% <100.00%> (ø)`
workers/datasets_based/tests/test_worker_loop.py	`100.00% <100.00%> (ø)`
... and 38 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

chart/env/dev.yaml

chart/env/prod.yaml

severo · 2023-02-07T18:15:58Z

libs/libcommon/README.md

@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects:
 - `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`.
 - `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty.
 - `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`.
-
+-  `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `100_000_000`


I'm currently working on removing all the "post_init" things from the config: we will have a new "Resource" object, that will be allocated and then released (database, log, storage, etc.)
So, if I finish soon enough, we could move this parameter to the cache config, it would be more logical.

libs/libcommon/src/libcommon/simple_cache.py

workers/datasets_based/src/datasets_based/worker.py

severo · 2023-02-08T08:47:32Z

I have changed my mind again 😬. Thinking more about it, I now believe that we want:

to handle the MongoDB 16MB limit nicely (currently, it returns UnexpectedError); here, we don't want to reduce this size, why would we limit the storage?
to limit the size of the REST API responses, at least some of them, to avoid wasting "bandwidth" and having big latencies.
- for the lists of things (split names, config names, parquet files, etc.), we can paginate. Additionally, we could use the maximum size (instead of the number of elements) to define where to cut for the next page (not sure if it's a good idea).
- for the other things (dataset_info, sizes), we should find a solution. It could be:
  - limiting the total size of the response
  - changing the format of the response so that it's small or it's a list of things that can be paginated. For example: /dataset-info is a nested dict that contains a list of features that can exceed the maximum size. Maybe we should not expose /dataset-info as such, but expose /features and /metadata (the rest)... I don't know. Also: maybe /first-rows could be paginated; that way, we can send bigger responses if needed, and the client decides to get more or fewer rows. Beware: it means changing how we store the response in the MongoDB, one response per page?

My ideas on this are not clear, so for now (this PR), I think we can:

add a limit on the response size, i.e.: in the worker/datasets_based/ directory (sorry to make you change again)

and create issues to:

handle the 16MB limit in MongoDB with a dedicated error
paginate the responses
change the format of some endpoints

severo

Thanks, it's better that way.

severo · 2023-02-09T10:07:42Z

.github/workflows/_e2e_tests.yml

@@ -37,6 +37,7 @@ jobs:
          ADMIN_UVICORN_NUM_WORKERS: "2"
          ADMIN_UVICORN_PORT: "8081"
          COMMON_LOG_LEVEL: "DEBUG"
+          COMMON_CONTENT_MAX_SIZE : "10_000_000"


could you move it to DATASETS_BASED_CONTENT_MAX_BYTES? (here, and elsewhere in the code: chart/, datasets_based/, tools/)

COMMON_ -> DATASETS_BASED_: what we really want would be WORKER_CONTENT_MAX_SIZE (or WORKER_RESPONSE_CONTENT_MAX_SIZE), but we will create this "namespace" when working on Use a generic worker #737, for now DATASETS_BASED is the namespace for all the workers)

SIZE -> BYTES: giving the unit avoids errors when modifying the configurations

severo · 2023-02-09T10:11:12Z

chart/values.yaml

@@ -47,6 +47,7 @@ common:
  hfEndpoint: ""
  # Log level
  logLevel: "INFO"
+  contentMaxSize: 1000000


"10_000_000", no?

severo · 2023-02-09T10:11:38Z

chart/env/prod.yaml

@@ -61,7 +61,8 @@ common:
  hfEndpoint: "https://huggingface.co"
  # Log level
  logLevel: "INFO"
-
+  contentMaxSize: 10_000_000


I think it must be a string, not a number: "10_000_000"

severo · 2023-02-09T10:11:42Z

chart/env/dev.yaml

@@ -71,6 +71,7 @@ common:
  hfEndpoint: "https://huggingface.co"
  # Log level
  logLevel: "DEBUG"
+  contentMaxSize: 10_000_000


I think it must be a string, not a number: "10_000_000"

severo · 2023-02-09T10:15:32Z

libs/libcommon/README.md

@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects:
 - `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`.
 - `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty.
 - `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`.
-
+-  `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `10_000_000`


(to be moved to workers/datasets_based/README.md)

what about:

the maximum size in bytes of the response content computed by a worker (to prevent returning big responses in the REST API). Defaults to 10_000_000.

severo · 2023-02-09T10:16:31Z

tools/docker-compose-dev-base.yml

@@ -6,6 +6,7 @@ services:
      COMMON_HF_ENDPOINT: ${COMMON_HF_ENDPOINT-https://hub-ci.huggingface.co}
      COMMON_HF_TOKEN: ${COMMON_HF_TOKEN-hf_app_datasets-server_token}
      COMMON_LOG_LEVEL: ${COMMON_LOG_LEVEL-INFO}
+      COMMON_CONTENT_MAX_SIZE: ${COMMON_CONTENT_MAX_SIZE-10_00_000}


workers/datasets_based/src/datasets_based/worker.py

workers/datasets_based/tests/workers/test__datasets_based_worker.py

Co-authored-by: Sylvain Lesage <[email protected]>

severo

Details, and it's OK

chart/values.yaml

tools/docker-compose-dev-base.yml

workers/datasets_based/tests/conftest.py

Co-authored-by: Sylvain Lesage <[email protected]>

AndreaFrancis marked this pull request as draft February 7, 2023 15:26

AndreaFrancis marked this pull request as ready for review February 7, 2023 15:43

AndreaFrancis requested review from severo, albertvillanova and lhoestq February 7, 2023 15:43

lhoestq approved these changes Feb 7, 2023

View reviewed changes

libs/libcommon/src/libcommon/config.py Outdated Show resolved Hide resolved

severo requested changes Feb 7, 2023

View reviewed changes

AndreaFrancis force-pushed the dataset-info-big-content-error branch from bf6b956 to 4058200 Compare February 7, 2023 18:05

severo reviewed Feb 7, 2023

View reviewed changes

AndreaFrancis marked this pull request as draft February 7, 2023 18:22

AndreaFrancis force-pushed the dataset-info-big-content-error branch 2 times, most recently from ad98e68 to c0b699c Compare February 8, 2023 16:31

AndreaFrancis marked this pull request as ready for review February 8, 2023 16:34

AndreaFrancis requested a review from severo February 8, 2023 16:51

This was referenced Feb 8, 2023

Handle the 16MB limit in MongoDB with a dedicated error #795

Closed

Paginate responses #796

Closed

severo reviewed Feb 9, 2023

View reviewed changes

AndreaFrancis and others added 9 commits February 9, 2023 10:03

Adding validation and test

b4138bc

Adding new configuration

6c31561

Applying suggestions

2ee7790

Fixing typo

1741bc5

Rollback to previous changes in worker

15da4a5

Removing unused code

8a77a52

Removing unused code in simple cache

8b1c789

Apply suggestions from code review

51b361e

Co-authored-by: Sylvain Lesage <[email protected]>

Applying suggestions on code review

06b3e81

AndreaFrancis force-pushed the dataset-info-big-content-error branch from 51cf310 to 06b3e81 Compare February 9, 2023 14:04

AndreaFrancis marked this pull request as draft February 9, 2023 14:13

AndreaFrancis added 2 commits February 9, 2023 11:28

Fix broken classes

cef3101

Make style

3b5690c

AndreaFrancis marked this pull request as ready for review February 9, 2023 16:30

AndreaFrancis requested review from lhoestq and severo February 9, 2023 16:30

severo approved these changes Feb 9, 2023

View reviewed changes

chart/values.yaml Outdated Show resolved Hide resolved

tools/docker-compose-dev-base.yml Outdated Show resolved Hide resolved

workers/datasets_based/tests/conftest.py Outdated Show resolved Hide resolved

Apply suggestions from code review

f4131f2

Co-authored-by: Sylvain Lesage <[email protected]>

AndreaFrancis merged commit afd2e49 into huggingface:main Feb 9, 2023

AndreaFrancis deleted the dataset-info-big-content-error branch February 9, 2023 17:55

severo mentioned this pull request Feb 13, 2023

Handle the case where the DatasetInfo is too big #762

Closed

AndreaFrancis mentioned this pull request Feb 13, 2023

Add a check in /first-rows worker if truncating didn't succeed #770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset info big content error #780

Dataset info big content error #780

AndreaFrancis commented Feb 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 7, 2023 •

edited

Loading

lhoestq left a comment

severo Feb 7, 2023

AndreaFrancis Feb 7, 2023

severo Feb 7, 2023

AndreaFrancis Feb 8, 2023

codecov-commenter commented Feb 7, 2023 •

edited

Loading

severo Feb 7, 2023

severo commented Feb 8, 2023 •

edited

Loading

severo left a comment

severo Feb 9, 2023

severo Feb 9, 2023

severo Feb 9, 2023

severo Feb 9, 2023

severo Feb 9, 2023

severo Feb 9, 2023

severo left a comment

Dataset info big content error #780

Dataset info big content error #780

Conversation

AndreaFrancis commented Feb 7, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Feb 7, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 7, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

severo commented Feb 8, 2023 • edited Loading

severo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

severo left a comment

Choose a reason for hiding this comment

AndreaFrancis commented Feb 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 7, 2023 •

edited

Loading

codecov-commenter commented Feb 7, 2023 •

edited

Loading

severo commented Feb 8, 2023 •

edited

Loading