feat(clp-s): Add support for ingesting logs from S3. #639

gibber9809 · 2024-12-16T22:24:34Z

Description

This PR adds support for ingesting logs from S3, and majorly refactors the command line interface to simplify specifying resources that exist at some network location. A follow-up PR will build on the utilties added by this PR to implement search on archives located on s3, as well as general search on single-file archives.

The major changes are as follows

Added InputConfig.{hpp,cpp} which introduces new Path and NetworkAuth structs which can be used in combination with new utilties in ReaderUtils to open readers for resources which may exist on the network or local filesystem
Modified JsonFileIterator and JsonParser to transparently support ingestion from the network or filesystem via new Path and NetworkAuth structs
Modified ArchiveReader to accept Path and NetworkAuth, and just return an error when the requested archive is a SFA or exists on the network.
Modified the command line arguments to allow specifying the full path to an archive (the --archive-id option still exists, but is not required). Archive id is also no longer passed around explicitly, but rather determined based on the path to an archive.
Made CommandLineArguments fully determine the paths to all relevant files/archives based on the command line options to allow simplifying clp-s.cpp
Removed uses of boost::filesystem

Validation performed

Validated that logs from s3 can be ingested succesfully
Validated that search/extraction correctly error out when trying to interact with S3 or any single file archive
Validated that compression, decompression, and search function as normal for multi-file archives

Summary by CodeRabbit

Release Notes

New Features
- Introduced support for network authentication and input path handling in command-line arguments.
- Added new functionalities for managing input configurations and creating readers for file and network sources.
- Enhanced command-line argument parsing with new options for authentication types and input paths.
- Integrated new source files for improved functionality in the application.
Bug Fixes
- Enhanced error handling for archive operations and input path validations.
Documentation
- Updated method signatures and documentation to reflect changes in input path and authentication handling.
Refactor
- Transitioned from Boost Filesystem to C++17's standard filesystem library for improved maintainability.
- Streamlined the handling of archive paths and JSON parsing logic for clarity and efficiency.

…rce including network auth

…dd utilities for finding archive Id given a Path

coderabbitai · 2024-12-16T22:24:42Z

Walkthrough

This pull request introduces significant changes to the CLP (Compressed Log Processing) project, focusing on enhancing input path handling, network authentication, and filesystem operations. The modifications primarily involve transitioning from Boost Filesystem to C++17 standard filesystem, introducing new input configuration management, and updating various components to support more flexible file and archive handling. The changes include adding new classes like InputConfig, modifying reader utilities, and updating command-line argument processing to support network and filesystem sources.

Changes

File	Change Summary
`components/core/CMakeLists.txt`	Added `Boost::url` library linking, updated source files.
`components/core/src/clp_s/CMakeLists.txt`	Added multiple new source files, updated library dependencies.
`components/core/src/clp_s/InputConfig.{cpp,hpp}`	New files introducing input configuration management, path source determination, and authentication options.
`components/core/src/clp_s/CommandLineArguments.{cpp,hpp}`	Updated to handle new input paths and network authentication.
`components/core/src/clp_s/ArchiveReader.{cpp,hpp}`	Modified `open` method to use new path and authentication handling.
`components/core/src/clp_s/ReaderUtils.{cpp,hpp}`	Added methods for creating readers with network and file support.
`components/core/src/clp_s/Utils.{cpp,hpp}`	Transitioned from Boost to standard filesystem, added new utility methods.
`components/core/src/clp_s/JsonConstructor.{cpp,hpp}`	Updated to utilize new path and authentication handling.
`components/core/src/clp_s/JsonParser.{cpp,hpp}`	Modified to support new input paths and network authentication.
`components/core/src/clp_s/JsonFileIterator.{cpp,hpp}`	Changed to use `ReaderInterface` for file reading.
`components/core/src/clp_s/ZstdDecompressor.cpp`	Transitioned to standard filesystem, updated error handling.
`components/core/src/clp_s/clp-s.cpp`	Updated command line argument handling for archives.
`components/core/src/clp_s/search/kql/CMakeLists.txt`	Updated library dependencies.
`components/core/tests/test-clp_s-end_to_end.cpp`	Adjusted tests to reflect new input path handling.

Suggested Reviewers

haiqi96
kirkrodrigues

Possibly Related PRs

test(clp-s): Add end-to-end test case for compression and extraction. #595: This PR modifies the CMakeLists.txt file to include new source files, including InputConfig.cpp and InputConfig.hpp, which are also referenced in the main PR, indicating a direct connection in terms of file management and dependencies.
feat(clp-s): Chunk output by size (in bytes) during ordered decompression. #600: This PR introduces changes to the CommandLineArguments class, including the handling of chunk sizes, which relates to the new functionalities added in the main PR regarding input configuration and handling.
feat(clp-s): Add command line options for stubbed out kv-pair-IR ingestion. #618: This PR adds command-line options for file types, which may interact with the new input configuration functionalities introduced in the main PR, particularly in how different input types are processed.
feat(clp-s): Add support for kv-pair-IR ingestion. #630: This PR adds support for key-value pair IR ingestion, which is relevant to the new functionalities in the main PR that enhance input handling and configuration.
fix(clp-package): Remove faulty error handling for parsing archive compression stats. #640: This PR addresses a bug in the compression workers' logic, which may relate to the overall handling of input and output processes that are being enhanced in the main PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

🔭 Outside diff range comments (1)

components/core/src/clp_s/JsonFileIterator.hpp (1)

Line range hint 17-26: Update constructor documentation to reflect parameter changes.

The Doxygen comments for the constructor still refer to file_name, but the parameter has been changed to reader. Please update the documentation to match the new parameter.

🧹 Nitpick comments (22)

components/core/src/clp_s/ZstdDecompressor.cpp (2)
6-8: Consider evaluating std::filesystem's memory mapping capabilities

While the transition to std::filesystem is a positive step, the code still relies on Boost for memory mapping. Consider evaluating whether std::filesystem's memory mapping capabilities could replace boost::iostreams for a more consistent standard library approach.

206-212: Align condition with coding guidelines

According to the coding guidelines, prefer false == <expression> over !<expression>. Consider updating the error check:
-    if (error_code) {
+    if (false == !error_code) {
components/core/src/clp_s/ReaderUtils.hpp (1)
73-74: Add documentation for the new method.

The method name suggests it might fail, but there's no documentation explaining:

The purpose of the method

Expected failure cases

Return value semantics (when nullptr might be returned)

Parameter requirements

Add documentation like this:
+    /**
+     * Attempts to create a reader for the given path with optional network authentication
+     * @param path The path to the resource (local or network)
+     * @param network_auth Network authentication options if required
+     * @return A reader interface if successful, nullptr otherwise
+     * @throws OperationFailed if the resource exists but cannot be read
+     */
     static std::shared_ptr<clp::ReaderInterface>
     try_create_reader(Path const& path, NetworkAuthOption const& network_auth);
components/core/tests/test-clp_s-end_to_end.cpp (1)
103-112: Enhance error handling for archive path operations.

While the implementation correctly uses the new Path struct, consider adding error handling for scenarios where:

The archive path is inaccessible

The archive path permissions are insufficient

Consider this improvement:
 for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
     if (false == entry.is_directory()) {
         // Skip non-directories
         continue;
     }

+    try {
         constructor_option.archive_path = clp_s::Path{
                 .source{clp_s::InputSource::Filesystem},
                 .path{entry.path().string()}
         };
         clp_s::JsonConstructor constructor{constructor_option};
         constructor.store();
+    } catch (const std::filesystem::filesystem_error& e) {
+        FAIL("Failed to process archive: " << e.what());
+    }
 }
components/core/src/clp_s/Utils.hpp (1)
48-61: Consider additional URI edge cases.

The method documentation provides good examples, but consider handling these edge cases:

Empty URIs

URIs with query parameters but no path

URIs with fragments

URIs with encoded characters

Consider adding validation for these cases and updating the documentation with examples:
 /**
  * Gets the last component of a uri.
  *
  * For example:
  * https://www.something.org/abc-xyz -> abc-xyz
  * https://www.something.org/aaa/bbb/abc-xyz?something=something -> abc-xyz
+ * https://www.something.org -> "" (empty URI path)
+ * https://www.something.org?param=value -> "" (no path component)
+ * https://www.something.org/abc%20xyz -> "abc xyz" (encoded characters)
+ * https://www.something.org/abc#section -> "abc" (with fragment)
  *
  * @param uri
  * @param name Returned component name
  * @return true on success, false otherwise
  */
components/core/src/clp_s/CMakeLists.txt (2)
221-222: Verify version constraints for new dependencies

The addition of Boost::regex, Boost::url, CURL, and OpenSSL::Crypto libraries requires careful version management:

Consider adding minimum version requirements for these dependencies

Document any specific version requirements in the project's README

Consider adding version checks in CMake:
find_package(Boost 1.81 REQUIRED COMPONENTS regex url)
find_package(CURL 7.88 REQUIRED)
find_package(OpenSSL 3.0 REQUIRED)
Also applies to: 228-228

82-83: Consider separating network configuration

The new InputConfig component appears to handle both local and network paths. Consider:

Separating network authentication configuration into a dedicated component

Using the factory pattern for reader creation based on path type
components/core/src/clp_s/InputConfig.hpp (2)

16-16: Typographical error in comment: "definining" should be "defining"

There is a typo in the comment on line 16. The word "definining" should be corrected to "defining".

49-49: Incomplete parameter documentation in function comments

The @param annotations lack descriptions for the parameters. Please provide brief explanations for each parameter to improve code readability and maintainability.

Also applies to: 55-55, 63-63, 72-72, 80-80, 89-89, 98-98

components/core/src/clp_s/ArchiveReader.hpp (1)

34-35: Update parameter documentation.

The parameter documentation should be updated to describe the new parameters:

archive_path: Path to the archive

network_auth: Network authentication options
components/core/src/clp_s/clp-s.cpp (1)
355-361: Enhance error message specificity.

While the error handling is good, consider making the error message more specific by including the failure reason:
-                SPDLOG_ERROR("Failed to open archive - {}", e.what());
+                SPDLOG_ERROR("Failed to open archive at {} - {}", archive_path.path, e.what());
components/core/src/clp_s/JsonParser.cpp (1)
427-429: Add specific error message for reader creation failure.

While the error handling exists, it would be helpful to log the specific reason for reader creation failure:
         auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};
         if (nullptr == reader) {
+            SPDLOG_ERROR("Failed to create reader for path {}", path.path);
             m_archive_writer->close();
             return false;
         }
components/core/src/clp_s/InputConfig.cpp (3)
11-17: Consider logging exceptions in get_source_for_path for better debugging

In the function get_source_for_path, exceptions are caught but not logged. Logging the exception details can assist in diagnosing issues when the function defaults to InputSource::Network due to an exception.

Apply the following diff to add exception logging:
 try {
     return std::filesystem::exists(path) ? InputSource::Filesystem : InputSource::Network;
 } catch (std::exception const& e) {
+    Logger::error("Exception occurred while checking path existence: {}", e.what());
     return InputSource::Network;
 }
76-86: Remove unreachable code after the switch statement in get_archive_id_from_path

The return true; statement at line 85 is unreachable because all paths in the switch statement end with a return statement. Removing this line will clean up the code and prevent any confusion.

Apply the following diff:
     default:
         return false;
     }
-    return true;
 }
27-47: Refactor duplicated logic into a shared utility function

The functions get_input_files_for_path and get_input_archives_for_path share similar logic for handling input paths and populating vectors. Consider refactoring the common code into a shared helper function to reduce duplication and enhance maintainability.

Also applies to: 54-74
components/core/src/clp_s/ReaderUtils.cpp (2)
163-180: Log exceptions when URL signing fails

In try_sign_url, exceptions are caught but not logged. Logging exceptions can provide valuable information during debugging.

Apply this diff to add error logging:
     } catch (std::exception const& e) {
+        SPDLOG_ERROR("Exception during URL signing: {}", e.what());
         return false;
     }
     return true;
 }
206-215: Add logging for unhandled InputSource types

When path.source is neither Filesystem nor Network, the function returns nullptr without any logging. For better debugging and maintenance, consider logging a warning when an unrecognized InputSource is encountered.

Apply this diff:
     } else {
+        SPDLOG_WARN("Unhandled InputSource type: {}", static_cast<int>(path.source));
         return nullptr;
     }
 }
components/core/src/clp_s/JsonFileIterator.hpp (1)

6-7: Remove unnecessary include of FileReader.hpp if it's no longer used.

Since FileReader is no longer utilized in this class, consider removing the include directive for FileReader.hpp to clean up the code.
components/core/src/clp_s/Utils.cpp (2)
68-100: Consider refactoring constant comparisons for maintainability

In directory_is_multi_file_archive, the multiple comparisons of formatted_name against constants can be simplified for better readability and maintainability. Consider using a std::unordered_set or std::set to store the constants and check for membership.

Here is a possible refactor:
// Define the set of archive filenames
static const std::unordered_set<std::string> archive_files = {
    constants::cArchiveTimestampDictFile,
    constants::cArchiveSchemaTreeFile,
    constants::cArchiveSchemaMapFile,
    constants::cArchiveVarDictFile,
    constants::cArchiveLogDictFile,
    constants::cArchiveArrayDictFile,
    constants::cArchiveTableMetadataFile,
    constants::cArchiveTablesFile
};

//...

if (archive_files.count(formatted_name) > 0) {
    continue;
} else {
    // ... Remaining code
}
91-97: Avoid using exceptions for control flow when parsing integers

Using exceptions for control flow can be inefficient. Instead of using std::stoi inside a try-catch block, consider checking if file_name is numeric before attempting to convert it.

Here is a possible refactor:
// Check if file_name is numeric
if (std::all_of(file_name.begin(), file_name.end(), ::isdigit)) {
    // It's a number, continue processing
    continue;
} else {
    return false;
}
components/core/src/clp_s/CommandLineArguments.cpp (2)

426-451: Refactor duplicated code for handling archive_path and archive_id

The logic for processing archive_path and archive_id is duplicated in multiple commands (Extract and Search). Refactoring this code into a shared function can improve maintainability and reduce code duplication.

Also applies to: 694-719

301-306: Refactor authentication handling into a common utility function

The authentication type handling logic is repeated across different command parsing sections (Compress, Extract, and Search). Consider abstracting this into a shared utility function to reduce redundancy and enhance code maintainability.

Also applies to: 447-451, 715-719

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 880a741 and a64bd6b.

📒 Files selected for processing (23)

components/core/CMakeLists.txt (2 hunks)
components/core/src/clp_s/ArchiveReader.cpp (1 hunks)
components/core/src/clp_s/ArchiveReader.hpp (2 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (0 hunks)
components/core/src/clp_s/CMakeLists.txt (3 hunks)
components/core/src/clp_s/CommandLineArguments.cpp (15 hunks)
components/core/src/clp_s/CommandLineArguments.hpp (3 hunks)
components/core/src/clp_s/InputConfig.cpp (1 hunks)
components/core/src/clp_s/InputConfig.hpp (1 hunks)
components/core/src/clp_s/JsonConstructor.cpp (3 hunks)
components/core/src/clp_s/JsonConstructor.hpp (2 hunks)
components/core/src/clp_s/JsonFileIterator.cpp (2 hunks)
components/core/src/clp_s/JsonFileIterator.hpp (3 hunks)
components/core/src/clp_s/JsonParser.cpp (7 hunks)
components/core/src/clp_s/JsonParser.hpp (5 hunks)
components/core/src/clp_s/ReaderUtils.cpp (2 hunks)
components/core/src/clp_s/ReaderUtils.hpp (2 hunks)
components/core/src/clp_s/Utils.cpp (2 hunks)
components/core/src/clp_s/Utils.hpp (2 hunks)
components/core/src/clp_s/ZstdDecompressor.cpp (2 hunks)
components/core/src/clp_s/clp-s.cpp (7 hunks)
components/core/src/clp_s/search/kql/CMakeLists.txt (1 hunks)
components/core/tests/test-clp_s-end_to_end.cpp (3 hunks)

💤 Files with no reviewable changes (1)

components/core/src/clp_s/ArchiveWriter.hpp

🧰 Additional context used

📓 Path-based instructions (19)

components/core/tests/test-clp_s-end_to_end.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ZstdDecompressor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonConstructor.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonFileIterator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/InputConfig.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonFileIterator.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/InputConfig.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (33)

components/core/src/clp_s/ReaderUtils.hpp (2)

4-8: LGTM! Include directives are well-organized.

The new includes are properly ordered (standard library before project-specific headers) and necessary for the new functionality.

Line range hint 1-99: Verify the impact of removing get_archives.

The removal of get_archives method might affect existing code that relied on listing archives from a directory.

Let's verify the usage and potential impact:

✅ Verification successful

The removal of get_archives method appears to be safe

The verification shows that:

There are no direct references to ReaderUtils::get_archives in the codebase
The functionality for getting archives is properly handled through CommandLineArguments::get_archives_dir() and GlobalSQLiteMetadataDB
No test files reference this specific method

The removal aligns with the codebase's architecture where archive handling is managed through dedicated components rather than utility classes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining references to the removed get_archives method

# Search for direct method calls
echo "Searching for direct method calls to get_archives..."
rg "get_archives\(" --type cpp

# Search for potential comments or TODOs referencing the method
echo "Searching for references in comments..."
rg "get_archives" --type cpp

# Search for tests that might need updating
echo "Searching for test files that might need updating..."
fd "test.*\.(cpp|hpp)$" | xargs rg "get_archives"

Length of output: 3638

Script:

#!/bin/bash
# Let's check for any references to ReaderUtils::get_archives specifically
echo "Searching for ReaderUtils::get_archives usage..."
rg "ReaderUtils::get_archives" --type cpp

# Let's also check if there are any references in the test files specifically for ReaderUtils
echo "Searching for ReaderUtils tests..."
fd "test.*\.(cpp|hpp)$" | xargs rg "ReaderUtils"

# Let's check the git history for the specific removal
echo "Checking git history for the removal..."
git log -p --no-merges -S "get_archives" -- "components/core/src/clp_s/ReaderUtils.hpp"

Length of output: 365

components/core/tests/test-clp_s-end_to_end.cpp (2)

12-12: LGTM!

The new include is properly placed and necessary for the updated path handling functionality.

74-76: Consider adding test cases for network sources.

While the implementation correctly uses the new Path struct for filesystem sources, we should also verify the S3 ingestion functionality mentioned in the PR objectives.

Let's check if there are any existing S3-related test cases:

Consider adding test cases that:

Verify S3 path handling
Test error scenarios with invalid S3 credentials
Validate behaviour with non-existent S3 buckets

components/core/src/clp_s/Utils.hpp (3)

8-9: LGTM: Required includes added and properly ordered.

The new includes for <string_view> and <vector> are necessary for the updated method signatures.

20-21: LGTM: Method name is now more descriptive.

The rename from find_all_files to find_all_files_in_directory better describes the method's purpose.

34-45: LGTM: Well-documented utility method with clear examples.

The get_last_non_empty_path_component method is thoroughly documented with examples that clarify its behaviour.

components/core/src/clp_s/CommandLineArguments.hpp (3)

15-15: LGTM: Include statement properly added

The addition of InputConfig.hpp is necessary for the new Path and NetworkAuthOption types.

55-57: LGTM: Well-designed getter methods

The getter methods are properly implemented with:

Const correctness
Efficient return by const reference
Clear naming that reflects their purpose

181-182: LGTM: Member variables properly structured

The member variables are well-organized and properly encapsulated. Let's verify the migration from file paths to input paths across the codebase.

✅ Verification successful

Migration from m_file_paths to m_input_paths verified

The codebase shows consistent usage of the new Path type and proper initialization of NetworkAuthOption:

The old m_file_paths member has been completely removed with no remaining references
m_input_paths is consistently used with the Path type across multiple components:
- CommandLineArguments.hpp
- JsonParser.hpp/cpp
- InputConfig.hpp/cpp
NetworkAuthOption is properly initialized with default construction where used

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of Path instead of std::string for paths
# and proper initialization of NetworkAuthOption across the codebase

# Check for any remaining uses of m_file_paths
rg "m_file_paths" --type cpp --type hpp

# Check for consistent use of Path type with m_input_paths
ast-grep --pattern 'std::vector<std::string> $path_var'

# Check for proper initialization of NetworkAuthOption
ast-grep --pattern 'NetworkAuthOption $var;'

Length of output: 172

Script:

#!/bin/bash
# Let's try with correct file extensions and more specific patterns

# Check for any remaining uses of m_file_paths
rg "m_file_paths" -t cpp -t h -t hpp

# Check for path-related vector declarations
rg "std::vector<.*Path.*>" -t cpp -t h -t hpp

# Check NetworkAuthOption usage
rg "NetworkAuthOption" -t cpp -t h -t hpp

# Look for input paths usage
rg "m_input_paths" -t cpp -t h -t hpp

Length of output: 276

Script:

#!/bin/bash
# Search using just the basic extensions and include context

# Check for any remaining uses of m_file_paths
rg "m_file_paths" -t cpp -t h --context 2

# Check for path-related vector declarations
rg "std::vector<.*Path.*>" -t cpp -t h --context 2

# Check NetworkAuthOption usage
rg "NetworkAuthOption" -t cpp -t h --context 2

# Look for input paths usage
rg "m_input_paths" -t cpp -t h --context 2

# Also try finding all header files to understand available extensions
fd -e h -e hpp -e hxx

Length of output: 36707

components/core/src/clp_s/search/kql/CMakeLists.txt (1)

28-28: LGTM! Removal of Boost::filesystem aligns with C++17 transition.

The change correctly removes the Boost::filesystem dependency since the target already uses C++20 which includes std::filesystem.

components/core/CMakeLists.txt (1)

260-261: LGTM! Addition of InputConfig files.

The new InputConfig files align with the PR's objective to support network resource handling.

components/core/src/clp_s/CMakeLists.txt (1)

22-23: Review thread safety of file operations

The addition of FileReader, NetworkReader, and related utilities suggests concurrent operations. Ensure:

Thread-safe file and network operations
Proper resource cleanup in error cases

Also applies to: 29-30, 37-38, 43-43, 46-47, 49-49

✅ Verification successful

Thread safety and resource management are properly implemented

The NetworkReader implementation demonstrates robust thread safety and resource management through:

Proper synchronization using std::mutex, std::condition_variable, and std::atomic variables
RAII-based resource management with clear cleanup in the destructor
Thread-safe buffer pool management with synchronized acquire/release operations
Proper error handling with try-catch blocks and state management

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for thread safety mechanisms and resource cleanup
ast-grep --pattern 'class $_Reader {
  $$$
  std::mutex $_
  $$$
}'

Length of output: 69

Script:

#!/bin/bash
# Search for FileReader implementation and mutex usage
rg -U "class .*Reader" -A 10

# Check for resource cleanup patterns
rg -U "~.*Reader|close\(\)|cleanup\(\)" -A 5

# Look for thread-safety related includes
rg "#include.*mutex|thread|atomic"

# Search for RAII patterns and resource management
rg "std::(unique_ptr|shared_ptr|lock_guard|scoped_lock)"

Length of output: 68699

Script:

#!/bin/bash
# Search for NetworkReader implementation to check thread safety
rg -U "class NetworkReader" -A 30

# Check for mutex and thread synchronization in NetworkReader
rg -U "std::mutex|std::thread|std::condition_variable" components/core/src/clp/NetworkReader

# Look for resource cleanup in NetworkReader destructor and error handling
rg -U "~NetworkReader|NetworkReader::cleanup" -A 10

Length of output: 4189

Script:

#!/bin/bash
# Search for synchronization primitives in NetworkReader implementation
rg -U "std::mutex|std::thread|std::condition_variable|std::atomic" components/core/src/clp/NetworkReader.hpp

# Look for buffer management and thread safety in NetworkReader
rg -U "class Buffer|BufferPool|synchronized|thread_safe" components/core/src/clp/NetworkReader.hpp

# Check for error handling and resource cleanup patterns
rg -U "try|catch|RAII|cleanup|release" components/core/src/clp/NetworkReader.cpp

Length of output: 1193

components/core/src/clp_s/ArchiveReader.cpp (1)

19-30: Proper input validation in the open method

The input validation checks in the open method enhance robustness by ensuring that the archive path is valid, the source is the filesystem, and that the path is a directory. This is a good practice to prevent potential errors during runtime.

components/core/src/clp_s/JsonConstructor.hpp (1)

30-35: Consistent update of JsonConstructorOption with new input structures

The modifications to JsonConstructorOption, replacing archives_dir and archive_id with archive_path and network_auth, are consistent with the updated input handling mechanism. This improves clarity and maintainability.

components/core/src/clp_s/JsonParser.hpp (1)

Line range hint 35-48: Enhancement of input handling and network authentication

The replacement of file_paths with input_paths of type std::vector<Path> and the addition of network_auth enhance the flexibility and scalability of the JsonParser. The changes are consistently applied in both the option struct and class members, aligning with the new input configuration approach.

Also applies to: 115-116

components/core/src/clp_s/ArchiveReader.hpp (2)

11-11: LGTM! Include changes align with network resource handling.

The addition of InputConfig.hpp and removal of boost/filesystem.hpp reflects the transition to the new input configuration management system.

37-37: LGTM! Method signature change supports network resources.

The updated signature properly supports both local and network-based archive access.

components/core/src/clp_s/clp-s.cpp (3)

15-15: LGTM! Proper curl initialization for network operations.

The addition of CurlGlobalInstance ensures proper initialization of libcurl for network operations.

Also applies to: 287-287

310-310: LGTM! Extract command properly handles network archives.

The changes correctly integrate network authentication and path handling for archive extraction.

Also applies to: 318-319

363-368: LGTM! Proper function call formatting.

The search_archive function call maintains good readability with proper parameter alignment.

components/core/src/clp_s/JsonParser.cpp (2)

23-24: LGTM! Proper member initialization.

The new member variables are correctly initialized in the constructor's initialization list.

439-439: LGTM! Consistent path reporting in error messages.

Error messages consistently use path.path for reporting file locations.

Also applies to: 473-473, 498-498, 532-532, 542-542

components/core/src/clp_s/JsonConstructor.cpp (2)

38-38: Function call updated with network authentication—Looks good!

The addition of m_option.network_auth to the m_archive_reader->open call correctly integrates network authentication into the archive opening process.

115-115: Duplicate concern regarding sanitization of get_archive_id()

As previously mentioned, when using m_archive_reader->get_archive_id()—here within the BSON document—it is important to validate and sanitize the archive ID to ensure data integrity and security.

components/core/src/clp_s/ReaderUtils.cpp (2)

3-9: Includes are appropriate and necessary

The added include directives are correct and required for the new functionality.

154-161: Proper exception handling in try_create_file_reader

The function correctly handles exceptions and logs errors when opening files. Good job.

components/core/src/clp_s/JsonFileIterator.cpp (1)

10-17: Constructor implementation is correct.

The constructor now correctly initializes m_reader with the provided clp::ReaderInterface& reader.

components/core/src/clp_s/Utils.cpp (5)

23-23: No issues found

The check for whether path is a directory is correctly implemented using std::filesystem::is_directory.

29-29: No issues found

Correctly checks if the directory is empty using std::filesystem::is_empty.

42-43: Efficient handling of empty directories

The code correctly disables recursion into empty directories, improving traversal efficiency.

126-148: No issues found

The function get_last_non_empty_path_component correctly retrieves the last non-empty component of a path.

150-161: No issues found

The function get_last_uri_component properly parses the URI and retrieves the last component using boost::urls.

components/core/src/clp_s/Utils.hpp

components/core/CMakeLists.txt

components/core/src/clp_s/CMakeLists.txt

components/core/src/clp_s/JsonConstructor.cpp

components/core/src/clp_s/ReaderUtils.cpp

components/core/src/clp_s/JsonFileIterator.hpp

components/core/src/clp_s/CommandLineArguments.cpp

haiqi96 · 2024-12-18T19:54:41Z

components/core/src/clp_s/ReaderUtils.cpp

-            archive_paths.push_back(iter->path().string());
+bool try_sign_url(std::string& url) {
+    clp::aws::AwsAuthenticationSigner signer{
+            std::getenv(cAwsAccessKeyIdEnvVar),


noticed that if I forget to pass in the access key, clp-s will throw a quite confusing message
terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_M_construct null not valid Aborted

We should let signer parse the env vars, verify they are not null and then initilaize the member variables

Or we can implement our own helper which either parse the envvar or throw a more meaningful exception

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

components/core/src/clp_s/JsonParser.cpp (1)
474-477: Enhance error message clarity

The error messages could be more descriptive by indicating whether the path is a network URL or local file path.
-        "Encountered non-json-object while trying to parse {} after parsing {} bytes",
-        path.path,
+        "Encountered non-json-object while trying to parse {} ({}) after parsing {} bytes",
+        path.path,
+        path.is_network_path() ? "network path" : "local file",
Apply similar changes to other error messages for consistency.

Also applies to: 499-502, 533-536, 543-545

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 561634c and 307be9e.

📒 Files selected for processing (1)

components/core/src/clp_s/JsonParser.cpp (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (1)

components/core/src/clp_s/JsonParser.cpp (1)

24-26: LGTM: Constructor member initialization list updated correctly

The new member variables are properly initialized in the constructor's initialization list, maintaining the class's initialization consistency.

components/core/src/clp_s/JsonParser.cpp

coderabbitai · 2024-12-19T22:19:41Z

components/core/src/clp_s/JsonParser.cpp

+    for (auto const& path : m_input_paths) {
+        auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};
+        if (nullptr == reader) {


🛠️ Refactor suggestion

Add error logging for reader creation failure

When reader creation fails, we should log an error message before returning false to help with debugging.

auto reader{ReaderUtils::try_create_reader(path, m_network_auth)}; if (nullptr == reader) { + SPDLOG_ERROR("Failed to create reader for path: {}", path.path); m_archive_writer->close(); return false; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for (auto const& path : m_input_paths) {

auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};

if (nullptr == reader) {

for (auto const& path : m_input_paths) {

auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};

if (nullptr == reader) {

SPDLOG_ERROR("Failed to create reader for path: {}", path.path);

m_archive_writer->close();

return false;

}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

components/core/src/clp_s/clp-s.cpp (2)

Line range hint 318-324: Enhance error message specificity

Consider providing more specific error messages to distinguish between network authentication failures and other types of archive access failures.

-            SPDLOG_ERROR("{}", e.what());
+            SPDLOG_ERROR("Failed to decompress archive {}: {}", archive_path, e.what());

363-368: Improve readability of function call

The function call formatting could be improved for better readability.

-            if (false
-                == search_archive(
-                        command_line_arguments,
-                        archive_reader,
-                        expr->copy(),
-                        reducer_socket_fd
-                ))
+            if (false == search_archive(
+                    command_line_arguments,
+                    archive_reader,
+                    expr->copy(),
+                    reducer_socket_fd))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 307be9e and 7557c84.

📒 Files selected for processing (1)

components/core/src/clp_s/clp-s.cpp (6 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (3)

components/core/src/clp_s/clp-s.cpp (3)

15-15: LGTM: CURL integration for S3 support

The addition of CURL support is appropriate for handling S3 operations.

Also applies to: 287-287

310-316: LGTM: Network authentication support for extraction

The network authentication support is properly integrated into the extraction process.

91-92: Verify network authentication handling

The addition of network authentication support aligns with S3 integration. However, we should verify the error handling for network authentication failures.

✅ Verification successful

Let me gather more information about the error handling implementation.

Let me check the network authentication implementation further.

Let me check the error handling in the network reader implementation.

Based on the review of the implementation, I can now provide a final response.

Network authentication error handling is properly implemented

The codebase shows comprehensive error handling for network authentication:

Network reader operations are wrapped in try-catch blocks with proper error logging
Authentication failures are caught and reported through the error handling system
The implementation includes proper cleanup of resources in error scenarios
Error propagation is maintained through multiple layers (ReaderUtils -> NetworkReader -> CurlDownloadHandler)

Key error handling locations:

components/core/src/clp_s/ReaderUtils.cpp: Catches and logs network reader failures
components/core/src/clp_s/clp-s.cpp: Handles archive opening failures with authentication
components/core/src/clp/NetworkReader.cpp: Implements robust error handling for network operations

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check error handling for network authentication
# Look for try-catch blocks around network operations

rg -A 5 "network_auth" --type cpp

Length of output: 9649

Script:

#!/bin/bash
# Check error handling in network reader creation and authentication
ast-grep --pattern 'try_create_network_reader($path, $auth) {
  $$$
}'

# Check for network-related error handling
rg -B 2 -A 5 "catch.*exception" --type cpp

Length of output: 34622

Script:

#!/bin/bash
# Check network authentication implementation
ast-grep --pattern 'try_create_network_reader($path, $auth) {
  $$$
}'

# Check network reader implementation
rg -A 10 "class.*NetworkReader" --type cpp

Length of output: 906

Script:

#!/bin/bash
# Check NetworkReader implementation and error handling
rg -B 2 -A 10 "NetworkReader::" --type cpp

# Check for network-related error codes
rg "ErrorCode.*Network" --type hpp

Length of output: 14029

components/core/src/clp_s/clp-s.cpp

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

components/core/src/clp_s/JsonParser.cpp (2)

429-431: 🛠️ Refactor suggestion

Add detailed error logging for reader creation failure

When reader creation fails, we should provide more context about the failure reason.

 auto reader{ReaderUtils::try_create_reader(path, m_network_auth)};
 if (nullptr == reader) {
+    SPDLOG_ERROR("Failed to create reader for path: {} - Type: {}", path.path, path.type);
     m_archive_writer->close();
     return false;
 }

548-563: ⚠️ Potential issue

Improve network error handling structure

The current implementation has several issues that were previously identified:

The archive writer is closed inside the network error handling block
The error message could be more specific by including the HTTP status code

Apply this diff to improve the error handling:

 if (auto network_reader = std::dynamic_pointer_cast<clp::NetworkReader>(reader);
     nullptr != network_reader)
 {
     if (auto const rc = network_reader->get_curl_ret_code();
         rc.has_value() && CURLcode::CURLE_OK != rc.value())
     {
         auto curl_error_message = network_reader->get_curl_error_msg();
-        std::string error_msg_str;
-        if (curl_error_message.has_value()) {
-            error_msg_str = curl_error_message.value();
-        }
-        SPDLOG_ERROR("Encountered curl error during ingestion - {}", error_msg_str);
-        m_archive_writer->close();
+        SPDLOG_ERROR(
+            "Encountered curl error during ingestion of {} - Code: {} - {}",
+            path.path,
+            rc.value(),
+            curl_error_message.value_or("Unknown error")
+        );
         return false;
     }
 }

🧹 Nitpick comments (1)

components/core/src/clp_s/JsonParser.cpp (1)
441-441: Enhance error messages with path type information

Consider including the path type in error messages to help with debugging network-related issues.

Example for one of the messages:
 SPDLOG_ERROR(
-        "Encountered error - {} - while trying to parse {} after parsing {} bytes",
+        "Encountered error - {} - while trying to parse {} (type: {}) after parsing {} bytes",
         simdjson::error_message(json_file_iterator.get_error()),
         path.path,
+        path.type,
         bytes_consumed_up_to_prev_record
 );
Also applies to: 475-475, 500-500, 534-534, 544-544

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7557c84 and fc8fad6.

📒 Files selected for processing (1)

components/core/src/clp_s/JsonParser.cpp (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (1)

components/core/src/clp_s/JsonParser.cpp (1)

24-26: LGTM: Constructor changes properly initialize new members

The initialization of new members follows the class's existing initialization pattern and aligns with the PR objectives.

… url authentication

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

components/core/src/clp_s/ReaderUtils.cpp (5)
154-161: Enhance error message clarity

Consider adding the specific error code to the log message for better debugging:
-        SPDLOG_ERROR("Failed to open file for reading - {} - {}", file_path, e.what());
+        SPDLOG_ERROR("Failed to open file for reading - {} - Error: {} (code: {})", file_path, e.what(), e.get_error_code());
184-185: Enhance error logging in catch block

The catch block should log the specific error for better debugging:
-    } catch (std::exception const& e) {
+    } catch (std::exception const& e) {
+        SPDLOG_ERROR("Failed to sign URL: {}", e.what());
         return false;
201-203: Add logging for unsupported auth method

Consider adding a log message when an unsupported authentication method is encountered:
         default:
+            SPDLOG_ERROR("Unsupported authentication method: {}", static_cast<int>(auth.method));
             return nullptr;
208-209: Include sanitized URL in error message

Consider including a sanitized version of the URL in the error message for better debugging:
-        SPDLOG_ERROR("Failed to open url for reading - {}", e.what());
+        SPDLOG_ERROR("Failed to open url for reading - {} - {}", sanitize_url(request_url), e.what());
Note: You'll need to implement a sanitize_url function that removes sensitive information like access keys.

214-223: Add debug logging for input source

Consider adding debug logging to help with troubleshooting:
 std::shared_ptr<clp::ReaderInterface>
 ReaderUtils::try_create_reader(Path const& path, NetworkAuthOption const& network_auth) {
+    SPDLOG_DEBUG("Creating reader for {} source: {}", 
+        path.source == InputSource::Filesystem ? "filesystem" : "network",
+        path.path);
     if (InputSource::Filesystem == path.source) {
components/core/src/clp_s/JsonParser.cpp (1)

441-441: Consider using a more descriptive name for the path member.

The usage of path.path could be confusing. Consider renaming the member to something more descriptive like file_path or resource_path to better indicate its purpose.

Also applies to: 475-475, 500-500, 534-534, 544-544

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc8fad6 and 2a87da2.

📒 Files selected for processing (2)

components/core/src/clp_s/JsonParser.cpp (7 hunks)
components/core/src/clp_s/ReaderUtils.cpp (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ReaderUtils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (5)

components/core/src/clp_s/ReaderUtils.cpp (2)

3-9: LGTM: Appropriate includes added

The new includes properly support the added functionality for logging, AWS authentication, and network operations.

164-172: AWS credentials validation implemented correctly

The implementation now properly checks for null environment variables and provides clear error messages.

components/core/src/clp_s/JsonParser.cpp (3)

24-26: LGTM: Constructor member initialization list is well-structured.

The initialization of new members follows the existing pattern and maintains consistency.

429-431: Add specific error logging for reader creation failure.

The error handling should include specific error messages to aid in debugging.

548-564: 🛠️ Refactor suggestion

Improve error handling structure for network operations.

While the error message formatting has improved, closing the archive writer inside the network error handling block could prevent processing remaining paths. Consider restructuring to handle cleanup after processing all paths.

 if (auto network_reader = std::dynamic_pointer_cast<clp::NetworkReader>(reader);
     nullptr != network_reader)
 {
     if (auto const rc = network_reader->get_curl_ret_code();
         rc.has_value() && CURLcode::CURLE_OK != rc.value())
     {
         auto const curl_error_message = network_reader->get_curl_error_msg();
         SPDLOG_ERROR(
                 "Encountered curl error while ingesting {} - Code: {} - Message: {}",
                 path.path,
                 rc.value(),
                 curl_error_message.value_or("Unknown error")
         );
-        m_archive_writer->close();
         return false;
     }
 }

Likely invalid or redundant comment.

wraymo

Most comments focus on fixing docstrings, some are just personal preferences.

components/core/src/clp_s/InputConfig.hpp

wraymo · 2024-12-20T21:50:29Z

components/core/src/clp_s/CommandLineArguments.cpp

+            )(
+                    "auth",
+                    po::value<std::string>(&auth)
+                        ->value_name("AUTH_TYPE")


Can we use the same name for this value_name and the class name in InputConfig? Either AUTH_METHOD (AuthMethod) or AUTH_TYPE (AuthType)

wraymo · 2024-12-23T02:58:36Z

components/core/src/clp_s/CommandLineArguments.cpp

            )(
                    "output-dir",
                    po::value<std::string>(&m_output_dir),
                    "The output directory for the decompressed file"
            );
            // clang-format on

-            po::options_description input_options("Input Options");


Why do we move it to decompression options?

By accident -- I took out the option then ended up adding it back. I can move it back to Input Options if you want?

Yeah, I think both are fine.

wraymo · 2024-12-23T03:18:33Z

components/core/src/clp_s/CommandLineArguments.cpp

+            if (archive_path.empty()) {
+                throw std::invalid_argument("No archive path specified");
+            }
+
+            if (false == archive_id.empty()) {
+                auto archive_fs_path = std::filesystem::path(archive_path) / archive_id;
+                if (false == std::filesystem::exists(archive_fs_path)) {
+                    throw std::invalid_argument("Requested archive does not exist");
+                }
+                m_input_paths.emplace_back(clp_s::Path{
+                        .source{clp_s::InputSource::Filesystem},
+                        .path{archive_fs_path.string()}
+                });
+            } else if (false == get_input_archives_for_raw_path(archive_path, m_input_paths)) {
+                throw std::invalid_argument("Invalid archive path");
+            }
+
+            if (m_input_paths.empty()) {
+                throw std::invalid_argument("No archive paths specified");
+            }
+
+            if (cS3Auth == auth) {
+                m_network_auth.method = AuthMethod::S3PresignedUrlV4;
+            } else if (cNoAuth != auth) {
+                throw std::invalid_argument(fmt::format("Invalid authentication type \"{}\"", auth)
+                );


Can we deduplicate the code a bit?

wraymo · 2024-12-23T03:23:07Z

components/core/src/clp_s/CommandLineArguments.cpp

+                throw std::invalid_argument("No archive path specified");
+            }
+
+            if (false == archive_id.empty()) {


It seems that without archive id, the path can be either a url for a filesystem path, but with archive id, only filesystem paths are allowed?

Yes -- I considered taking out the archive id option entirely, but decided to leave it in to maintain our existing behaviour for archives on a filesystem. Not really any point in supporting the archive-id option for network addresses since we just figure out the archive id based on the url.

For network addresses, do we consider supporting archive_path + archive_id?

Probably not because it needs to be handled with more care/might not be possible to support in a simple generic way. E.g. consider a url with some params like https://www.something.com/archive_id?auth=typeA or even something like https://www.something.com/endpoint?id=abc-xyz.

Yeah, you are right.

wraymo · 2024-12-23T03:26:48Z

components/core/src/clp_s/JsonFileIterator.hpp

@@ -22,7 +23,7 @@ class JsonFileIterator {
     * @param buf_size the initial buffer size
     */
    explicit JsonFileIterator(
-            std::string const& file_name,
+            clp::ReaderInterface& reader,


Can we update the description?

Co-authored-by: wraymo <[email protected]>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

components/core/src/clp_s/Utils.hpp (2)
34-45: Consider adding edge case examples to documentation

The documentation is clear but could be more comprehensive. Consider adding examples for edge cases:
     * For example:
     * ./foo/bar.baz -> bar.baz
     * ./foo/bar.baz/ -> bar.baz
+     * ./foo/ -> foo
+     * / -> [empty result]
+     * ./ -> [empty result]
48-61: Consider adding URI validation and error cases

While the method handles basic URI parsing, consider adding validation for:

Malformed URIs

URIs with empty paths

URIs with query parameters but no path

Also, consider adding examples for error cases in the documentation:
     * For example:
     * https://www.something.org/abc-xyz -> abc-xyz
     * https://www.something.org/aaa/bbb/abc-xyz?something=something -> abc-xyz
+     * https://www.something.org/ -> [returns false]
+     * https://www.something.org?param=value -> [returns false]
components/core/src/clp_s/Utils.cpp (1)
150-161: Consider validating URI scheme

The function accepts any URI without validating the scheme, which could lead to unexpected behaviour with non-file URIs.

Consider adding scheme validation:
 bool UriUtils::get_last_uri_component(std::string_view const uri, std::string& name) {
     auto parsed_result = boost::urls::parse_uri(uri);
     if (false == parsed_result.has_value()) {
         return false;
     }
     auto parsed_uri = parsed_result.value();
+    // Validate supported schemes
+    auto scheme = parsed_uri.scheme();
+    if (!scheme.empty() && scheme != "file" && scheme != "s3") {
+        return false;
+    }
     auto path_segments_view = parsed_uri.segments();
     if (path_segments_view.empty()) {
         return false;
     }
     name = path_segments_view.back();
     return true;
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a87da2 and 1f7660b.

📒 Files selected for processing (3)

components/core/src/clp_s/InputConfig.hpp (1 hunks)
components/core/src/clp_s/Utils.cpp (2 hunks)
components/core/src/clp_s/Utils.hpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/core/src/clp_s/InputConfig.hpp

🧰 Additional context used

📓 Path-based instructions (2)

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (4)

components/core/src/clp_s/Utils.hpp (3)

8-9: LGTM: Modern C++ includes added appropriately

The addition of <string_view> and <vector> headers aligns with modern C++ practices and supports the new method signatures.

20-21: LGTM: Method name is now more descriptive

The rename from find_all_files to find_all_files_in_directory improves clarity by explicitly stating the method's scope.

24-32: Documentation needs clarification about directory inclusion conditions

The documentation states "including the directory itself" without specifying the conditions under which the directory is included. Based on past review comments, this behaviour is conditional.

Please update the documentation to clarify:

-     * Finds all archives in a directory, including the directory itself
+     * Finds all archives in a directory. The directory itself is included only
+     * when it qualifies as a multi-file archive (contains only files).

components/core/src/clp_s/Utils.cpp (1)

126-148: LGTM! Clean implementation of path component extraction

The function correctly handles path normalization and empty component cases using std::filesystem.

coderabbitai · 2024-12-30T23:56:32Z

components/core/src/clp_s/Utils.cpp

+        std::filesystem::recursive_directory_iterator iter(
                path,
-                boost::filesystem::directory_options::follow_directory_symlink
+                std::filesystem::directory_options::follow_directory_symlink
        );


⚠️ Potential issue

Security: Consider restricting symlink following

The recursive directory iterator is configured to follow all symlinks unconditionally, which could lead to:

Infinite loops if symlinks create cycles

Access to unintended directories

Potential security vulnerabilities in a multi-user environment

Consider adding validation or restricting symlink following. Here's a suggested implementation:

- std::filesystem::recursive_directory_iterator iter( - path, - std::filesystem::directory_options::follow_directory_symlink - ); + std::filesystem::recursive_directory_iterator iter(path);

If symlink following is required, add validation:

if (std::filesystem::is_symlink(entry)) { auto real_path = std::filesystem::read_symlink(entry); // Add your validation logic here if (!is_safe_path(real_path)) { iter.disable_recursion_pending(); continue; } }

components/core/src/clp_s/Utils.cpp

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (2)

components/core/src/clp_s/Utils.cpp (2)
35-38: ⚠️ Potential issue

Security: Consider restricting symlink following

The recursive directory iterator is configured to follow all symlinks unconditionally, which could lead to security vulnerabilities. Consider restricting symlink following or adding validation.
-std::filesystem::recursive_directory_iterator iter(
-        path,
-        std::filesystem::directory_options::follow_directory_symlink
-);
+std::filesystem::recursive_directory_iterator iter(path);
If symlink following is required, add validation:
if (std::filesystem::is_symlink(entry)) {
    auto real_path = std::filesystem::read_symlink(entry);
    // Add validation logic here
    if (!is_safe_path(real_path)) {
        iter.disable_recursion_pending();
        continue;
    }
}
74-80: 🛠️ Refactor suggestion

Performance: Avoid string formatting in loop

String formatting operations for each file name are inefficient, especially in directories with many files.
-std::string file_name;
-if (false == FileUtils::get_last_non_empty_path_component(entry.path().string(), file_name))
-{
-    return false;
-}
-auto formatted_name = fmt::format("/{}", file_name);
+auto const& file_name = entry.path().filename().string();
+std::string_view formatted_name{file_name};

🧹 Nitpick comments (10)

components/core/src/clp_s/Utils.hpp (2)
15-21: Update documentation to match renamed method.

The documentation should be updated to reflect the new method name find_all_files_in_directory.
-     * Finds all files in a directory
+     * Finds all files in a directory recursively
48-61: Enhance documentation with error cases.

The method documentation would benefit from including examples of error cases where the method returns false.
     * For example:
     * https://www.something.org/abc-xyz -> abc-xyz
     * https://www.something.org/aaa/bbb/abc-xyz?something=something -> abc-xyz
+     * 
+     * Error cases (returns false):
+     * Empty URI
+     * URI ending with '/'
+     * Malformed URI
components/core/src/clp_s/JsonParser.cpp (1)

856-861: Track the TODO for network source support in IR parsing.

The early return for non-filesystem sources is correct, but we should track the TODO for adding network source support.

Would you like me to create a GitHub issue to track the implementation of IR ingestion from network sources?
components/core/src/clp_s/JsonFileIterator.hpp (2)
13-21: Enhance documentation with reader interface requirements

The documentation should specify:

The expected lifetime of the reader object

Any requirements or assumptions about the reader implementation

Whether the reader needs to be configured/opened before being passed

26-28: Consider adding validation in the constructor

To improve robustness, consider adding validation in the constructor to ensure the reader is in a valid state before proceeding.
 explicit JsonFileIterator(
         clp::ReaderInterface& reader,
         size_t max_document_size,
-        size_t buf_size = 1024 * 1024 /*1MB default*/
+        size_t buf_size = 1024 * 1024 /*1MB default*/
 );
+
+// Add to the implementation file:
+JsonFileIterator::JsonFileIterator(clp::ReaderInterface& reader,
+        size_t max_document_size, size_t buf_size)
+        : m_reader(reader) {
+    if (!reader.is_open()) {
+        throw std::invalid_argument("Reader must be open");
+    }
+    // ... rest of the implementation
+}
components/core/src/clp_s/clp-s.cpp (1)
309-315: Enhance error handling for network authentication

While the addition of network authentication support is good, consider enhancing the error handling to distinguish between network and local filesystem failures.
 option.network_auth = command_line_arguments.get_network_auth();
 if (false == command_line_arguments.get_mongodb_uri().empty()) {
     option.metadata_db
             = {command_line_arguments.get_mongodb_uri(),
                command_line_arguments.get_mongodb_collection()};
 }
+try {
+    validate_network_auth(option.network_auth);
+} catch (const NetworkAuthException& e) {
+    SPDLOG_ERROR("Invalid network authentication configuration: {}", e.what());
+    return false;
+}
components/core/src/clp_s/Utils.cpp (1)
80-88: Optimize file name comparison using a set

The multiple string comparisons could be optimized by using a set of known archive file names for constant-time lookups.
+namespace {
+const std::set<std::string_view> kArchiveFiles = {
+    constants::cArchiveTimestampDictFile,
+    constants::cArchiveSchemaTreeFile,
+    constants::cArchiveSchemaMapFile,
+    constants::cArchiveVarDictFile,
+    constants::cArchiveLogDictFile,
+    constants::cArchiveArrayDictFile,
+    constants::cArchiveTableMetadataFile,
+    constants::cArchiveTablesFile
+};
+}

bool is_multi_file_archive(std::string_view const path) {
    for (auto const& entry : std::filesystem::directory_iterator{path}) {
        if (entry.is_directory()) {
            return false;
        }

        auto const& file_name = entry.path().filename().string();
        auto formatted_name = fmt::format("/{}", file_name);
-        if (constants::cArchiveTimestampDictFile == formatted_name
-            || constants::cArchiveSchemaTreeFile == formatted_name
-            || constants::cArchiveSchemaMapFile == formatted_name
-            || constants::cArchiveVarDictFile == formatted_name
-            || constants::cArchiveLogDictFile == formatted_name
-            || constants::cArchiveArrayDictFile == formatted_name
-            || constants::cArchiveTableMetadataFile == formatted_name
-            || constants::cArchiveTablesFile == formatted_name)
+        if (kArchiveFiles.contains(formatted_name))
components/core/src/clp_s/CMakeLists.txt (1)

252-252: Consider adding Boost::ASIO for network operations

Since the PR adds network functionality, consider using Boost::ASIO alongside Boost::url for robust network operations.
components/core/src/clp_s/CommandLineArguments.cpp (2)
66-78: Enhance the network authentication validation.

Consider these improvements to make the validation more robust and user-friendly:

Add case-insensitive comparison for auth methods

Include valid options in the error message
 void validate_network_auth(std::string_view auth_method, NetworkAuthOption& auth) {
-    if (cS3Auth == auth_method) {
+    auto to_lower = [](std::string_view str) {
+        std::string result(str);
+        std::transform(result.begin(), result.end(), result.begin(), ::tolower);
+        return result;
+    };
+    
+    auto method = to_lower(auth_method);
+    if (to_lower(cS3Auth) == method) {
         auth.method = AuthMethod::S3PresignedUrlV4;
-    } else if (cNoAuth != auth_method) {
-        throw std::invalid_argument(fmt::format("Invalid authentication type \"{}\"", auth_method));
+    } else if (to_lower(cNoAuth) != method) {
+        throw std::invalid_argument(
+            fmt::format("Invalid authentication type \"{}\". Valid options are: {} or {}",
+                       auth_method, cNoAuth, cS3Auth));
     }
 }
292-296: Consider validating paths before processing.

The current implementation doesn't validate if the paths exist before attempting to process them. This could lead to confusing error messages later in the processing pipeline.
 for (auto const& path : input_paths) {
+    try {
+        if (false == std::filesystem::exists(path)) {
+            throw std::invalid_argument(fmt::format("Path does not exist: \"{}\"", path));
+        }
+    } catch (const std::filesystem::filesystem_error& e) {
+        throw std::invalid_argument(fmt::format("Error accessing path \"{}\": {}", path, e.what()));
+    }
     if (false == get_input_files_for_raw_path(path, m_input_paths)) {
         throw std::invalid_argument(fmt::format("Invalid input path \"{}\".", path));
     }
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f7660b and 0b2c41a.

📒 Files selected for processing (10)

components/core/CMakeLists.txt (2 hunks)
components/core/src/clp_s/CMakeLists.txt (5 hunks)
components/core/src/clp_s/CommandLineArguments.cpp (17 hunks)
components/core/src/clp_s/JsonFileIterator.hpp (2 hunks)
components/core/src/clp_s/JsonParser.cpp (9 hunks)
components/core/src/clp_s/JsonParser.hpp (4 hunks)
components/core/src/clp_s/ReaderUtils.hpp (2 hunks)
components/core/src/clp_s/Utils.cpp (2 hunks)
components/core/src/clp_s/Utils.hpp (1 hunks)
components/core/src/clp_s/clp-s.cpp (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

components/core/CMakeLists.txt
components/core/src/clp_s/JsonParser.hpp

🧰 Additional context used

📓 Path-based instructions (7)

components/core/src/clp_s/ReaderUtils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonFileIterator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/clp-s.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (20)

components/core/src/clp_s/ReaderUtils.hpp (2)

4-8: LGTM! The includes are well-organized and necessary.

The new includes support the added functionality while maintaining proper organization (standard library followed by project headers).

74-80: Enhance method documentation for clarity and completeness.

The method implementation looks good, but the documentation could be more comprehensive. Consider adding:

Error conditions that would cause nullptr return
Ownership semantics of the returned smart pointer
Thread safety guarantees

Here's a suggested documentation improvement:

 /**
- * Tries to open a clp::ReaderInterface using the given Path and NetworkAuthOption.
+ * Creates a reader for accessing content at the specified path with optional network authentication.
+ * 
+ * @param path The path to the content to read, can be local or network-based
+ * @param network_auth Authentication details for network-based paths
+ * @return A shared pointer to the reader interface, or nullptr if:
+ *         - The path is invalid or inaccessible
+ *         - Network authentication fails
+ *         - The content format is unsupported
+ * 
+ * @thread_safety This method is thread-safe and can be called from multiple threads
+ * @note The caller takes shared ownership of the returned reader
  */

components/core/src/clp_s/Utils.hpp (3)

8-9: LGTM! Necessary includes added and properly ordered.

The new includes for <string_view> and <vector> are required for the updated method signatures.

24-32: Documentation needs clarification about directory inclusion conditions.

The documentation states that the directory itself is included, but based on past review comments, this only happens under specific conditions (when it's a multi-file archive).

-     * Finds all archives in a directory, including the directory itself
+     * Finds all archives in a directory. The directory itself is included only
+     * when it qualifies as a multi-file archive (contains only files).

34-45: LGTM! Well-documented method with clear examples.

The method is well-documented with clear examples and uses string_view for efficient string passing.

components/core/src/clp_s/JsonParser.cpp (2)

87-89: LGTM: Constructor initialization is correct.

The new member initializations properly set up the network authentication and input paths support.

492-494: LGTM: Reader creation and error handling.

The reader creation and initial error handling look correct.

components/core/src/clp_s/JsonFileIterator.hpp (1)

84-84: Ensure proper lifetime management of the reader reference

Since m_reader is a reference member variable, the ReaderInterface object must outlive the JsonFileIterator instance to prevent dangling references.

Consider adding a note in the class documentation about this lifetime requirement.

components/core/src/clp_s/clp-s.cpp (4)

15-15: LGTM: CurlGlobalInstance inclusion supports S3 functionality

The addition of CurlGlobalInstance.hpp is appropriate for implementing S3 support and follows the project's include organization pattern.

354-360: Consider graceful handling of individual archive failures

The current implementation stops processing when an archive fails to open. Consider allowing the operation to continue with remaining archives, especially when processing multiple S3 archives.

 for (auto const& archive_path : command_line_arguments.get_input_paths()) {
     try {
         archive_reader->open(archive_path, command_line_arguments.get_network_auth());
     } catch (std::exception const& e) {
-        SPDLOG_ERROR("Failed to open archive - {}", e.what());
-        return 1;
+        SPDLOG_ERROR("Failed to open archive {} - {}", archive_path, e.what());
+        continue;
     }

362-367: LGTM: Search functionality properly maintained

The search archive invocation correctly maintains expression isolation between archives by using copy(), and the parameter passing follows the established pattern.

91-92: Verify consistent usage of input_paths and network_auth

The transition from file_paths to input_paths and addition of network_auth aligns with S3 support. Let's verify the consistency of these changes across the codebase.

✅ Verification successful

Transition from file_paths to input_paths and addition of network_auth is consistent

The codebase analysis confirms that:

The change from file_paths to input_paths is part of a broader transition across the codebase, with input_paths being used consistently in the CLP-S component
The network_auth parameter is properly integrated and consistently paired with input_paths in relevant network-related operations
Remaining instances of file_paths are in different components (CLP and GLT) and are not affected by this change

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent usage of input_paths and network_auth across the codebase

# Check for any remaining instances of file_paths
rg "file_paths" --type cpp --type hpp

# Check for consistent pairing of input_paths with network_auth
ast-grep --pattern 'input_paths = $_;
    $$$
    network_auth = $_;'

Length of output: 137

Script:

#!/bin/bash
# Search for file_paths in C++ files
rg "file_paths" -t cpp

# Search for input_paths usage
rg "input_paths" -t cpp -A 3

# Search for network_auth usage
rg "network_auth" -t cpp -A 3

# Find header files with a more generic pattern
fd -e h -e hpp

Length of output: 44058

components/core/src/clp_s/Utils.cpp (3)

103-124: LGTM! Well-structured implementation with proper error handling

The function correctly handles both multi-file archives and regular files, with appropriate error handling.

126-148: LGTM! Clean transition to std::filesystem

The function has been successfully migrated to use std::filesystem with proper error handling and path normalization.

150-161: LGTM! Robust URI parsing implementation

The function correctly uses boost::urls for URI parsing with proper error handling.

components/core/src/clp_s/CMakeLists.txt (4)

5-15: Verify AWS and CURL components security configuration

The addition of AWS and CURL components requires careful security considerations. Previous review comments have highlighted concerns about SSL/TLS verification settings.

Run the following script to verify the security configurations:

#!/bin/bash
# Description: Check for SSL/TLS verification settings in AWS and CURL components

# Check AWS authentication components
echo "Checking AWS authentication components..."
rg -A 5 "SSL|TLS|verify" ../clp/aws/AwsAuthenticationSigner.{cpp,hpp}

# Check CURL components
echo "Checking CURL components..."
rg -A 5 "CURLOPT_(SSL|VERIFY|CERT)" ../clp/CurlDownloadHandler.{cpp,hpp} ../clp/CurlEasyHandle.hpp

252-253: Review dependency version constraints

The addition of network-related dependencies (Boost::url, CURL, OpenSSL::Crypto) requires careful version management:

Boost::url is relatively new - ensure compatibility
CURL version should be specified to avoid security issues
OpenSSL::Crypto version should be specified to avoid known vulnerabilities

Also applies to: 259-259

113-114: Verify input validation for network paths

The new InputConfig component handles both local and network paths. Ensure proper validation of network paths to prevent security issues like path traversal.

✅ Verification successful

Network path validation is properly implemented

After reviewing the codebase, I can confirm that network paths are properly validated and handled securely:

The InputConfig module implements proper path source detection and validation:
- get_source_for_path() correctly identifies network vs filesystem paths
- Network paths are handled through NetworkReader with proper error handling
- URL signing is implemented for S3 paths with proper authentication
Network path handling includes:
- Proper URL parsing using boost::urls
- Error handling for malformed URLs
- Authentication support for S3 presigned URLs
- CURL error checking and reporting
Security measures in place:
- Network paths are isolated from filesystem operations
- URL components are properly validated before use
- Authentication credentials are properly handled
- Error conditions are properly caught and reported

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check input validation in InputConfig

echo "Checking input validation..."
rg -A 5 "validate|sanitize|check" InputConfig.{cpp,hpp}

echo "Checking path handling..."
rg -A 5 "path|url|uri" InputConfig.{cpp,hpp}

Length of output: 476

Script:

#!/bin/bash
# Search for InputConfig files in the repository
echo "Searching for InputConfig files..."
fd -t f "InputConfig\.(cpp|hpp)$"

# Try searching in components/core/src/clp_s directory
echo "Checking specific directory..."
ls -la components/core/src/clp_s/InputConfig.*

# Search for path-related code in the clp_s directory
echo "Searching for path handling in clp_s directory..."
rg -A 5 "path|url|uri" components/core/src/clp_s/

Length of output: 66242

42-43: Ensure consistent error handling across reader implementations

The addition of FileReader and NetworkReader suggests a dual-path approach for handling local and remote resources. Ensure consistent error handling and resource cleanup across both implementations.

Also applies to: 61-62

✅ Verification successful

Error handling is consistent across reader implementations

Based on the code analysis, both FileReader and NetworkReader demonstrate consistent error handling patterns:

Both use a uniform error code system (ErrorCode enum)
Both implement thorough error checking:
- FileReader checks for initialization, file existence, I/O errors, and EOF conditions
- NetworkReader handles network errors, timeouts, and buffer management errors
Both provide detailed error reporting:
- FileReader uses OperationFailed exception with traceable context
- NetworkReader uses OperationFailed exception with additional CURL error details
Both implement proper resource cleanup:
- FileReader ensures file handles are closed
- NetworkReader manages buffer pools and network connections

The implementations show a well-designed error handling strategy that maintains consistency across local and remote resource access.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check error handling consistency across reader implementations

# Compare error handling patterns
echo "Comparing error handling in FileReader and NetworkReader..."
rg -A 5 "throw|catch|error|exception" ../clp/FileReader.{cpp,hpp} ../clp/NetworkReader.{cpp,hpp}

Length of output: 516

Script:

#!/bin/bash
# Let's first find the correct paths for these files
fd -e cpp -e hpp "FileReader|NetworkReader"

Length of output: 846

Script:

#!/bin/bash
# Now let's examine error handling in the relevant files
echo "=== FileReader error handling patterns ==="
rg -A 5 "throw|catch|error|exception" "components/core/src/clp_s/FileReader.{cpp,hpp}"

echo -e "\n=== NetworkReader error handling patterns ==="
rg -A 5 "throw|catch|error|exception" "components/core/src/clp/NetworkReader.{cpp,hpp}"

# Let's also check if these are derived from a common base class
echo -e "\n=== Checking for base class inheritance ==="
rg "class.*Reader.*:" "components/core/src/clp/Reader.hpp" "components/core/src/clp_s/Reader.hpp" || true

Length of output: 913

Script:

#!/bin/bash
# Let's examine the actual content of these files
echo "=== FileReader content ==="
cat components/core/src/clp_s/FileReader.cpp || true
cat components/core/src/clp_s/FileReader.hpp || true

echo -e "\n=== NetworkReader content ==="
cat components/core/src/clp/NetworkReader.cpp || true
cat components/core/src/clp/NetworkReader.hpp || true

# Let's also check if there are any error-related tests
echo -e "\n=== Error handling tests ==="
rg -A 5 "TEST.*Reader.*error|exception" components/core/tests/test-*.cpp || true

Length of output: 39795

components/core/src/clp_s/CommandLineArguments.cpp (1)

713-724: Handle exceptions from filesystem operations.

The filesystem operations should handle potential exceptions.

This is similar to a previous review comment. The same exception handling pattern should be applied here.

components/core/src/clp_s/JsonParser.cpp

components/core/src/clp_s/JsonFileIterator.hpp

components/core/src/clp_s/Utils.cpp

components/core/src/clp_s/CommandLineArguments.cpp

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

components/core/src/clp_s/JsonFileIterator.hpp (3)
12-20: Update documentation to consistently use "input stream" terminology.

The class documentation has been updated to use "input stream" in some places, but method descriptions still use "file". For consistency, update all documentation to use "input stream" or "reader" terminology.

Line range hint 41-42: Update method documentation to reflect stream-based implementation.

Methods get_num_bytes_read() and get_num_bytes_consumed() still reference "file" in their documentation. Update to use "input stream" or "reader" terminology for consistency.
-     * @return total number of bytes read from the file
+     * @return total number of bytes read from the input stream
-     * @return total number of bytes consumed from the file via get_json
+     * @return total number of bytes consumed from the input stream via get_json
Also applies to: 46-47

Line range hint 51-52: Update error message documentation.

The documentation for get_error() still references "json file". Update to use consistent terminology.
-     * @return the last error code encountered when iterating over the json file
+     * @return the last error code encountered when iterating over the json input stream
components/core/src/clp_s/JsonParser.cpp (2)
504-504: Enhance error messages with source type

Consider including the source type (network/filesystem) in error messages for better debugging context.
-    SPDLOG_ERROR(
-            "Encountered error - {} - while trying to parse {} after parsing {} bytes",
-            simdjson::error_message(json_file_iterator.get_error()),
-            path.path
-    );
+    SPDLOG_ERROR(
+            "Encountered error - {} - while trying to parse {} (source: {}) after parsing {} bytes",
+            simdjson::error_message(json_file_iterator.get_error()),
+            path.path,
+            path.source == InputSource::Network ? "network" : "filesystem"
+    );
Also applies to: 538-538, 563-563, 597-597, 607-607

856-861: Track network source support TODO and improve error handling

The error handling for unsupported network sources could be more informative.

Would you like me to create a GitHub issue to track the TODO for network source support?

Consider this improvement for error handling:
     if (InputSource::Filesystem != path.source) {
+        SPDLOG_ERROR(
+            "IR ingestion from network sources is not yet supported (path: {})",
+            path.path
+        );
         m_archive_writer->close();
         return false;
     }
Also applies to: 865-865

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b2c41a and 93a1696.

📒 Files selected for processing (8)

components/core/src/clp_s/ArchiveReader.cpp (2 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (1 hunks)
components/core/src/clp_s/CommandLineArguments.cpp (17 hunks)
components/core/src/clp_s/InputConfig.cpp (1 hunks)
components/core/src/clp_s/JsonFileIterator.hpp (2 hunks)
components/core/src/clp_s/JsonParser.cpp (9 hunks)
components/core/src/clp_s/Utils.cpp (2 hunks)
components/core/tests/test-clp_s-end_to_end.cpp (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

components/core/tests/test-clp_s-end_to_end.cpp
components/core/src/clp_s/InputConfig.cpp

🧰 Additional context used

📓 Path-based instructions (6)

components/core/src/clp_s/ArchiveWriter.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveReader.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/Utils.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/CommandLineArguments.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/JsonFileIterator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🔇 Additional comments (21)

components/core/src/clp_s/JsonFileIterator.hpp (1)

25-25: Ensure the lifetime of reader exceeds that of JsonFileIterator.

Since m_reader is a reference member variable, ensure that the ReaderInterface object it references outlives the JsonFileIterator instance to prevent dangling references.

Also applies to: 83-83

components/core/src/clp_s/ArchiveReader.cpp (4)

7-7: LGTM! Include statement properly added.

The addition of InputConfig.hpp is necessary for the new Path and NetworkAuthOption types used in the refactored code.

13-13: LGTM! Method signature updated for network resource support.

The signature change to accept Path and NetworkAuthOption aligns with the PR objectives for supporting network-based resources.

19-29: LGTM! Robust validation logic implemented.

The validation checks are comprehensive and follow good practices:

Archive ID extraction validation

Filesystem source verification

Directory existence check

Proper error codes for each failure case

211-216: LGTM! Switch cases properly organized.

The changes improve code organization by:

Adding support for NodeType::Metadata

Grouping similar node types that don't require readers

Following consistent fall-through pattern

components/core/src/clp_s/JsonParser.cpp (3)

87-89: LGTM: Constructor initialization looks good, eh!

The initialization of network-related member variables follows the existing pattern and maintains consistency.

492-494: Add error logging for reader creation failure

When reader creation fails, we should log an error message before returning false to help with debugging.

611-627: Improve network error handling structure

The current implementation has several issues:

The archive writer is closed inside the network error handling block, which could prevent processing remaining paths

The error message could be more specific by including the HTTP status code if available

components/core/src/clp_s/CommandLineArguments.cpp (8)

20-22: LGTM! Well-defined constants for authentication methods.

The constants are appropriately scoped and named following the project's conventions.

66-78: LGTM! Clean implementation of network authentication validation.

The function is well-documented and handles validation appropriately with clear error messages.

80-113: LGTM! Robust implementation of archive path validation.

The function properly handles filesystem errors using std::error_code, which is the recommended approach for handling filesystem operations in modern C++.

192-200: LGTM! Improved input path handling.

The change from member variable to local variable improves encapsulation and makes the code flow clearer.

271-277: Consider consistent naming with InputConfig.

Based on past review comments, the value name for auth should be consistent with the class name in InputConfig.

387-392: LGTM! Clear documentation for archive path option.

The option description clearly explains both use cases: directory of archives or single archive path.

Line range hint 401-427: Consider moving auth option back to extraction options.

Based on past review comments, the auth option was accidentally moved to decompression options. Consider moving it back to extraction options for better organization.

Line range hint 514-573: LGTM! Consistent implementation of archive path and auth options.

The implementation follows the same pattern as the extraction command, maintaining consistency across the codebase.

components/core/src/clp_s/Utils.cpp (4)

1-12: Remove duplicate include of Utils.hpp

The header file is included twice at lines 1 and 12.

34-37: Security: Consider restricting symlink following

The recursive directory iterator follows all symlinks unconditionally.

73-79: Performance: Avoid string formatting in loop

String formatting operation for each file name is inefficient.

149-160: LGTM! Well-structured URI component extraction

The function properly handles URI parsing with appropriate error checking.

components/core/src/clp_s/ArchiveWriter.cpp (1)

273-276: Confirm handling for NodeType::Metadata
NodeType::Metadata is grouped here with NodeType::NullValue, Object, and StructuredArray, meaning no columns are appended. If this is intentional, confirm that no storage or additional handling is necessary for metadata fields; otherwise, consider maintaining a separate case to process NodeType::Metadata differently.

components/core/src/clp_s/Utils.cpp

gibber9809 added 13 commits December 16, 2024 20:45

Add Path and NetworkAuth structs which describe how to access a resou…

f274a2e

…rce including network auth

Remove boost::filesystem usage from ZstdDecompressor

d0807c5

Add utility to create a reader for a resource given Path and NetworkAuth

cfbfa94

Add utilities for finding all files or archives in a directory, and a…

97f1fe7

…dd utilities for finding archive Id given a Path

Update CMakeLists to pull in boost::urls

b33e3ae

Update ArchiveReader to accept Path and NetworkAuth

528c0bd

Accept clp::ReaderInterface in JsonFileIterator

6cfdf46

Update JsonParser to accept Path and NetworkAuth

63d8bdc

Update JsonConstructor to accept Path and NetworkAuth

e8f5b37

Remove unnecessary dependency from kql build target

29049b3

Update clp_s end to end tests to account for new interface

df0144a

Update command line arguments and simplify clp-s.cpp

cc75ffe

Catch exception when failing to open archive during search

a64bd6b

gibber9809 requested a review from wraymo December 16, 2024 22:27

coderabbitai bot reviewed Dec 16, 2024

View reviewed changes

gibber9809 added 3 commits December 17, 2024 01:25

Attempt to fix build issue on macos

8fa5aa2

Attempt to fix macos build again

fe84cd4

Revert all changes to kql CMakeLists to simplify review

561634c

haiqi96 reviewed Dec 18, 2024

View reviewed changes

Properly detect http errors when ingesting over the network

307be9e

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

Fix bug introduced while splitting up changes

7557c84

coderabbitai bot reviewed Dec 19, 2024

View reviewed changes

components/core/src/clp_s/clp-s.cpp Show resolved Hide resolved

Fix obvious bug introduced in recent commit

fc8fad6

coderabbitai bot reviewed Dec 20, 2024

View reviewed changes

gibber9809 added 2 commits December 20, 2024 18:55

Improve error message for curl error during ingestion

84bc5c3

Log an error when environment variables are unavailable for presigned…

2a87da2

… url authentication

coderabbitai bot reviewed Dec 20, 2024

View reviewed changes

wraymo reviewed Dec 23, 2024

View reviewed changes

Apply suggestions from code review

1f7660b

Co-authored-by: wraymo <[email protected]>

coderabbitai bot reviewed Dec 30, 2024

View reviewed changes

gibber9809 added 4 commits December 31, 2024 00:01

Complete rename

dbc16fb

Address code review comments

86fa165

Merge remote-tracking branch 'upstream/main' into clp-s-s3-ingestion

2a1ae5d

Deduplicate some validation code in CommandLineArguments.cpp

0b2c41a

gibber9809 requested review from haiqi96 and wraymo December 31, 2024 01:01

coderabbitai bot reviewed Dec 31, 2024

View reviewed changes

Address rabbit comments and fix macos build

93a1696

coderabbitai bot reviewed Jan 3, 2025

View reviewed changes

components/core/src/clp_s/Utils.cpp Show resolved Hide resolved

wraymo approved these changes Jan 8, 2025

View reviewed changes

gibber9809 merged commit 5d3b671 into y-scope:main Jan 8, 2025
20 checks passed

coderabbitai bot mentioned this pull request Jan 8, 2025

feat(clp-s): Unescape string values during ingestion and fix support for search using escape sequences. #622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(clp-s): Add support for ingesting logs from S3. #639

feat(clp-s): Add support for ingesting logs from S3. #639

gibber9809 commented Dec 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 16, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

haiqi96 Dec 18, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot Dec 19, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

wraymo left a comment

wraymo Dec 20, 2024

wraymo Dec 23, 2024

gibber9809 Dec 31, 2024

wraymo Jan 8, 2025

wraymo Dec 23, 2024

wraymo Dec 23, 2024

gibber9809 Dec 31, 2024

wraymo Jan 8, 2025

gibber9809 Jan 8, 2025

wraymo Jan 8, 2025

wraymo Dec 23, 2024

coderabbitai bot left a comment

coderabbitai bot Dec 30, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

feat(clp-s): Add support for ingesting logs from S3. #639

feat(clp-s): Add support for ingesting logs from S3. #639

Conversation

gibber9809 commented Dec 16, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Dec 16, 2024 • edited Loading

Walkthrough

Changes

Suggested Reviewers

Possibly Related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

haiqi96 Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 19, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Dec 30, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

gibber9809 commented Dec 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 16, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

haiqi96 Dec 18, 2024 •

edited

Loading