-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset info big content error #780
Dataset info big content error #780
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! thank you :)
libs/libcommon/README.md
Outdated
@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects: | |||
- `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`. | |||
- `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty. | |||
- `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`. | |||
|
|||
- `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `100_000_000` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it to the "CACHE_" namespace? It would make more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will I have to move also that config to CacheConfig
? If so, I am afraid to have the post_init
connect_to_cache_database
running.
Maybe add another namespace but will have to create another config class no handle it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working on removing all the "post_init" things from the config: we will have a new "Resource" object, that will be allocated and then released (database, log, storage, etc.)
So, if I finish soon enough, we could move this parameter to the cache config, it would be more logical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per last commit, I kept the logic in the worker
Codecov ReportBase: 90.69% // Head: 92.16% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #780 +/- ##
==========================================
+ Coverage 90.69% 92.16% +1.46%
==========================================
Files 70 33 -37
Lines 3493 2271 -1222
==========================================
- Hits 3168 2093 -1075
+ Misses 325 178 -147
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
bf6b956
to
4058200
Compare
libs/libcommon/README.md
Outdated
@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects: | |||
- `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`. | |||
- `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty. | |||
- `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`. | |||
|
|||
- `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `100_000_000` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working on removing all the "post_init" things from the config: we will have a new "Resource" object, that will be allocated and then released (database, log, storage, etc.)
So, if I finish soon enough, we could move this parameter to the cache config, it would be more logical.
I have changed my mind again 😬. Thinking more about it, I now believe that we want:
My ideas on this are not clear, so for now (this PR), I think we can:
and create issues to:
|
ad98e68
to
c0b699c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's better that way.
.github/workflows/_e2e_tests.yml
Outdated
@@ -37,6 +37,7 @@ jobs: | |||
ADMIN_UVICORN_NUM_WORKERS: "2" | |||
ADMIN_UVICORN_PORT: "8081" | |||
COMMON_LOG_LEVEL: "DEBUG" | |||
COMMON_CONTENT_MAX_SIZE : "10_000_000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move it to DATASETS_BASED_CONTENT_MAX_BYTES
? (here, and elsewhere in the code: chart/, datasets_based/, tools/)
COMMON_
->DATASETS_BASED_
: what we really want would beWORKER_CONTENT_MAX_SIZE
(orWORKER_RESPONSE_CONTENT_MAX_SIZE
), but we will create this "namespace" when working on Use a generic worker #737, for nowDATASETS_BASED
is the namespace for all the workers)SIZE
->BYTES
: giving the unit avoids errors when modifying the configurations
chart/values.yaml
Outdated
@@ -47,6 +47,7 @@ common: | |||
hfEndpoint: "" | |||
# Log level | |||
logLevel: "INFO" | |||
contentMaxSize: 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"10_000_000", no?
chart/env/prod.yaml
Outdated
@@ -61,7 +61,8 @@ common: | |||
hfEndpoint: "https://huggingface.co" | |||
# Log level | |||
logLevel: "INFO" | |||
|
|||
contentMaxSize: 10_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it must be a string, not a number: "10_000_000"
chart/env/dev.yaml
Outdated
@@ -71,6 +71,7 @@ common: | |||
hfEndpoint: "https://huggingface.co" | |||
# Log level | |||
logLevel: "DEBUG" | |||
contentMaxSize: 10_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it must be a string, not a number: "10_000_000"
libs/libcommon/README.md
Outdated
@@ -16,7 +16,7 @@ Set the common environment variables to configure the following aspects: | |||
- `COMMON_HF_ENDPOINT`: URL of the HuggingFace Hub. Defaults to `https://huggingface.co`. | |||
- `COMMON_HF_TOKEN`: App Access Token (ask moonlanding administrators to get one, only the `read` role is required) to access the gated datasets. Defaults to empty. | |||
- `COMMON_LOG_LEVEL`: log level, among `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. Defaults to `INFO`. | |||
|
|||
- `COMMON_CONTENT_MAX_SIZE`: the maximum size in bytes of the content to be stored in the cache. Defaults to `10_000_000` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to be moved to workers/datasets_based/README.md)
what about:
the maximum size in bytes of the response content computed by a worker (to prevent returning big responses in the REST API). Defaults to 10_000_000
.
tools/docker-compose-dev-base.yml
Outdated
@@ -6,6 +6,7 @@ services: | |||
COMMON_HF_ENDPOINT: ${COMMON_HF_ENDPOINT-https://hub-ci.huggingface.co} | |||
COMMON_HF_TOKEN: ${COMMON_HF_TOKEN-hf_app_datasets-server_token} | |||
COMMON_LOG_LEVEL: ${COMMON_LOG_LEVEL-INFO} | |||
COMMON_CONTENT_MAX_SIZE: ${COMMON_CONTENT_MAX_SIZE-10_00_000} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10_000_000
workers/datasets_based/tests/workers/test__datasets_based_worker.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Sylvain Lesage <[email protected]>
51cf310
to
06b3e81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Details, and it's OK
Co-authored-by: Sylvain Lesage <[email protected]>
Adding content size validation before trying to insert/update in db.
Fix for #762 and #770
New
DATASETS_BASED_CONTENT_MAX_BYTES
configuration is added to limit the size of the content result of a worker compute process.