-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-package): Add support for extracting JSON streams from archives stored on S3. #678
Conversation
Warning Rate limit exceeded@gibber9809 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 8 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThis pull request focuses on enhancing the handling of storage types, particularly for S3 storage, across two key components: the decompression utility script and the stream extraction task. The changes introduce more granular validation for storage types, adding support for checking both Changes
Possibly Related PRs
Suggested Reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (1)
109-112
: Avoid including all environment variables when setting AWS credentialsIncluding the entire
os.environ
inenv_vars
may unintentionally expose unnecessary or sensitive environment variables to the subprocess. It's safer to include only the required AWS credentials.Apply this diff to set only the necessary environment variables:
-env_vars = { - **os.environ, - "AWS_ACCESS_KEY_ID": s3_config.credentials.access_key_id, - "AWS_SECRET_ACCESS_KEY": s3_config.credentials.secret_access_key, -} +env_vars = { + "AWS_ACCESS_KEY_ID": aws_access_key_id, + "AWS_SECRET_ACCESS_KEY": aws_secret_access_key, +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/scripts/decompress.py
(3 hunks)components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
(4 hunks)
🔇 Additional comments (3)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (1)
142-172
: Well-structured command generation logicThe introduction of
_make_command_and_env_vars
function and the separation of command generation based onstorage_engine
enhance code modularity and readability.components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)
85-90
: Enhanced validation of storage type and engineAdding the check for both
StorageType.S3
andStorageEngine.CLP
improves error handling by providing clearer context when file extraction is unsupported.
169-173
: Improved error handling for unsupported stream extractionValidating both
storage_type
andstorage_engine
before proceeding ensures that unsupported extraction scenarios are correctly identified and handled.
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
Outdated
Show resolved
Hide resolved
if StorageType.FS != storage_type: | ||
logger.error(f"File extraction is not supported for archive storage type: {storage_type}.") | ||
storage_engine = clp_config.package.storage_engine | ||
if StorageType.S3 == storage_type and StorageEngine.CLP == storage_engine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, do we have support for decompressing clp-s archive from S3 now in our decompression script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, added by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? File decompression is not through _task.py but directly handled by native/decompress.py though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see what you mean. The extract file command doesn't seem to be supported for clp-s at all (the script is hardcoded for clp). We didn't previously have a check to prevent using it for clp-s, but I'll add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we will need a check to ensure it's only supported on CLP & FS combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
majorly nit
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
Outdated
Show resolved
Hide resolved
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
Show resolved
Hide resolved
…extract_stream_task.py Co-authored-by: haiqi96 <[email protected]>
@@ -162,9 +166,11 @@ def handle_extract_stream_cmd( | |||
return -1 | |||
|
|||
storage_type = clp_config.archive_output.storage.type | |||
if StorageType.FS != storage_type: | |||
storage_engine = clp_config.package.storage_engine | |||
if StorageType.S3 == storage_type and StorageEngine.CLP == storage_engine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I might write
if StorageType.S3 == storage_type and StorageEngine.CLP_S != storage_engine since we only expect CLP-s to support S3. but not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For title, how about:
feat(clp-package): Add support for extracting IR/JSON streams from S3.
or
feat(clp-package): Add support for extracting IR/JSON streams from archives stored on S3.
Since we don't actually add support for extracting to IR how about |
Right, right. The title you suggested makes more sense. |
Description
This PR adds support for extracting archives that are stored on S3 in the package. This feature is supported by pulling credentials from WorkerConfig, much like in #674.
Validation performed
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Refactor
These updates improve the system's reliability and flexibility when handling different storage types during extraction processes.