Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Added regex to sub special characters #545

Merged

Conversation

tinomerl
Copy link
Contributor

@tinomerl tinomerl commented Dec 2, 2024

Fixes #544.

I've tested the fix with three different sources.

Source File
import airbyte as ab

file_content = "col_1,col_2\nfoo,bar\n"
with open("test \ special chars.csv", "w") as f:
    f.write(file_content)

source: ab.Source = ab.get_source("source-file")
config = {
    "dataset_name": "test \ special chars",
    "format": "csv",
    "url": "test \ special chars.csv",
    "provider": {"storage": "local"}
}

source.set_config(config)
source.select_all_streams()
records = source.read()
records.streams.get("test \\ special chars").to_pandas()
PokeApi

import airbyte as ab

source: ab.Source = ab.get_source(    "source-pokeapi",
    config={"pokemon_name": "bulbasaur"},
    source_manifest=True,
)

source.check()
source.get_available_streams()
source.select_all_streams()
records = source.read()

It also worked on the custom Connector mentioned in the Bug Issue.

Summary by CodeRabbit

  • Bug Fixes
    • Improved file path generation by cleaning stream names to remove special characters, ensuring compatibility with operating system restrictions.

@tinomerl tinomerl changed the title 🐛Bugfix: added regex to sub special characters Fix: added regex to sub special characters Dec 2, 2024
Copy link

coderabbitai bot commented Dec 2, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes made in this pull request involve modifying the _get_new_cache_file_path method in the FileWriterBase class located in airbyte/_writers/file_writers.py. A new dependency, LowerCaseNormalizer, is introduced to sanitize the stream_name parameter by removing special characters that are not allowed in file names, specifically <>:"/\|?* and ASCII control characters (0-31). The sanitized normalized_stream_name is then utilized to construct the cache file path, which includes the batch_id and the default cache file suffix.

Changes

File Path Change Summary
airbyte/_writers/file_writers.py Modified _get_new_cache_file_path to sanitize the stream_name by removing special characters and ASCII control characters.

Assessment against linked issues

Objective Addressed Explanation
Remove special characters in stream names to prevent file creation issues (#544)

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bff1359 and ee76045.

📒 Files selected for processing (1)
  • airbyte/_writers/file_writers.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/_writers/file_writers.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
airbyte/_writers/file_writers.py (3)

65-69: Add safety checks for edge cases? 🛡️

The current implementation might produce an empty string if the stream name only contains special characters. Maybe we could add some validation? Something like:

         cleaned_stream_name = re.sub(r"[^a-zA-Z0-9\s]", "", stream_name)
+        if not cleaned_stream_name.strip():
+            raise ValueError(f"Stream name '{stream_name}' contains no valid characters for filename")

What are your thoughts on this additional safety check?


65-69: Would you like help with test coverage? 🧪

It would be great to add some unit tests to verify the behavior with:

  • Stream names containing various special characters
  • Edge cases like empty strings or whitespace-only names
  • Names that would collide after cleaning

Would you like me to help draft some test cases or open an issue to track this?


65-67: Consider enhancing the comment? 📝

The current comment is good but could be more specific about which OS restrictions we're addressing. Maybe something like:

-        # If a stream contains a special Character, the temporary jsonl.gz
-        # file can't be created, because of OS restrictions. Therefore, we
-        # remove the special characters.
+        # Windows and certain Linux filesystems (especially on WSL) restrict specific characters
+        # in filenames. Common restrictions include: <>:"/\|?* and control characters.
+        # We sanitize the stream name to ensure compatibility across different OS environments.

What do you think about this more detailed explanation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f2529a6 and e19b95e.

📒 Files selected for processing (1)
  • airbyte/_writers/file_writers.py (2 hunks)
🔇 Additional comments (2)
airbyte/_writers/file_writers.py (2)

7-7: LGTM on the import!

Clean addition of the re module, properly placed with other standard library imports.


65-69: Verify impact on existing tests and related code 🔍

Let's check if this change might affect existing tests or if similar sanitization might be needed elsewhere:

✅ Verification successful

Stream name sanitization looks safe, but how about adding a test? 🤔

From the verification results, I can see that the existing tests don't rely on special characters in stream names - they use simple alphanumeric names like "stream1", "products", "users", etc. Also, this is the only place where we're sanitizing stream names for file paths.

However, would you be interested in adding a test case that explicitly verifies this sanitization behavior? Something like testing that a stream name with special characters still works correctly? wdyt?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for tests that might rely on special characters in stream names
echo "Checking for tests using special characters in stream names..."
rg -l "stream_name.*[^a-zA-Z0-9\s]" tests/

# Check for other places that might need similar filename sanitization
echo "Checking for other file path construction patterns..."
ast-grep --pattern 'Path($1)' 

Length of output: 728


Script:

#!/bin/bash
# Let's check the content of these test files to see how they use stream names
echo "Checking test_progress.py for stream name usage:"
rg "stream_name" tests/unit_tests/test_progress.py -B 2 -A 2

echo "\nChecking test_source_faker_integration.py for stream name usage:"
rg "stream_name" tests/integration_tests/test_source_faker_integration.py -B 2 -A 2

echo "\nChecking test_state_handling.py for stream name usage:"
rg "stream_name" tests/integration_tests/test_state_handling.py -B 2 -A 2

# Let's also check for similar file path sanitization patterns
echo "\nChecking for similar file path sanitization patterns:"
rg "re\.sub.*stream_name" -B 2 -A 2

Length of output: 2984

airbyte/_writers/file_writers.py Outdated Show resolved Hide resolved
@tinomerl tinomerl changed the title Fix: added regex to sub special characters Fix: Added regex to sub special characters Dec 2, 2024
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinomerl - This is a valuable update - thank you!

Should we include similar logic also when designating a table name?

@tinomerl
Copy link
Contributor Author

tinomerl commented Dec 3, 2024

Hey @aaronsteers,

thanks, that's great to hear. I think there is already some kind of logic implemented for table names. Just looked at the table name for my file test and it looks like this, in the default cache:

image

Maybe i should adapt the same replacement pattern for filenames. So instead of removing any of the characters with nothing, using the underscore could be better. I think that would also cover some edge cases better. Like for example if we have something like this:

  • myStreamDataFromA\B
  • myStreamDataFromAB

They would currently be named the same. But with the underscores they would differ. What do you think?

@aaronsteers
Copy link
Contributor

@tinomerl - thanks for confirming that the table names are working well already. The method used for table name normalization is in the LowerCaseNormalizer class. You may be able to use the same here for normalizing the stream name prior to creating the path.

I'm ok with either approach but if the lower case normalizer class works well, that might be slightly easier to maintain, since the same code would be applied for both cases.

Want to try and let me know what you find? If for any reason that doesn't work well, I think your approach here is perfectly sound.

@tinomerl
Copy link
Contributor Author

tinomerl commented Dec 3, 2024

@aaronsteers - thanks for the Infos. I'll try it out tomorrow and let you know the results.

@tinomerl
Copy link
Contributor Author

tinomerl commented Dec 4, 2024

@aaronsteers - So i just changed it to use the LowerCaseNormalizer instead of just the regex pattern. And it works just fine. I like it this way more because, like you said, it can now be maintained at just one spot.

@aaronsteers
Copy link
Contributor

@tinomerl - This is great. I'll run a few more tests and merge/release later today if all is passing! Thanks again for this contribution. 🙏

@tinomerl
Copy link
Contributor Author

tinomerl commented Dec 4, 2024

@aaronsteers - thanks for the kind words. I'm always happy to help out. :)

@aaronsteers
Copy link
Contributor

aaronsteers commented Dec 5, 2024

/test-pr

PR test job started... Check job output.

❌ Tests failed.

@aaronsteers
Copy link
Contributor

I meant to merge earlier but ran into some test failures. All is looking good now so I'll go ahead and merge+release.

@aaronsteers aaronsteers merged commit ea54711 into airbytehq:main Dec 10, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛Bug: Problem with special Characters in Stream Names
2 participants