Added Prowlarr feed scraping & Improve Advanced scraping capability for prowlarr, zilean, torrentio & more bugfixes & improvements #286

mhdzumair · 2024-09-16T02:43:54Z

Summary by CodeRabbit

Release Notes

New Features
- Enhanced logging format for better traceability during debugging.
- Introduced background task handling in stream retrieval functions.
- New configuration options for Prowlarr feed scraping.
- Added new scraper functionalities for Prowlarr feed and improved IMDb data fetching.
Bug Fixes
- Improved stream retrieval process with case sensitivity handling.
Refactor
- Transitioned several scraper classes to an object-oriented design for better structure and efficiency.
Chores
- Updated homepage URLs for specific scraping entities.

These updates enhance the overall performance and flexibility of the application, providing users with a more robust streaming experience.

…or prowlarr, zilean, torrentio & more bugfixes & improvements

… individual task as soon as completed

coderabbitai · 2024-09-16T02:44:00Z

Walkthrough

The pull request introduces various modifications across multiple files, primarily enhancing the functionality of the streaming data retrieval process. Key changes include updates to dependency management, logging improvements, the addition of background task handling, and the introduction of new scraping functionalities. The Settings class is also expanded with new attributes related to Prowlarr feed scraping. Overall, the changes streamline the scraping process and improve the management of torrent streams.

Changes

File Path	Change Summary
Pipfile	Changed `parsett` dependency from a specific Git repository to a wildcard version specification (`*`).
api/main.py, db/crud.py	Enhanced logging format; modified `get_streams` function to accept `background_tasks` for task management.
api/scheduler.py	Added scheduling for `run_prowlarr_feed_scraper` based on a configuration flag.
api/task.py	Added import statement for `prowlarr_feed` to enhance functionality.
db/config.py	Reformatted `logo_url` and added new configuration attributes for Prowlarr feed scraping.
db/crud.py	Introduced `run_scrapers` function; updated streaming retrieval functions to support background tasks.
db/models.py	Added `indexer_flags` to `TorrentStreams` class; implemented equality and hash methods.
resources/json/scraper_config.json	Updated homepage URLs for "tamil_blasters" and "tamilmv".
scrapers/base_scraper.py	Introduced `BaseScraper` class for web scraping tasks with error handling and retry logic.
scrapers/imdb_data.py	Added asynchronous functions for fetching IMDb data related to specific episodes and seasons.
scrapers/prowlarr.py	Refactored `ProwlarrScraper` class to improve structure and efficiency for scraping torrent data.
scrapers/prowlarr_feed.py	Implemented a scraper for the Prowlarr feed to fetch and process torrent stream data.
scrapers/routes.py	Added functions for handling torrent stream storage for movies and series.
scrapers/torrentio.py	Introduced `TorrentioScraper` class to encapsulate torrent stream scraping functionality.
scrapers/utils.py	Added `run_scrapers` function to orchestrate scraping from multiple sources.
scrapers/zilean.py	Refactored `ZileanScraper` class to enhance functionality and structure.
streaming_providers/realdebrid/utils.py	Added a conditional check in `update_rd_cache_status` to prevent unnecessary processing.
streaming_providers/routes.py	Enhanced `fetch_stream_or_404` function with a fallback mechanism for case sensitivity.
utils/network.py	Modified `call` method in the circuit breaker class for better exception handling.
utils/parser.py	Changed error logging from `logging.error` to `logging.exception` for improved error tracking.

Possibly related PRs

Add support for RPDB posters & Support fallback posters via MediaFusion posters #284: The changes in api/main.py involve modifications to the get_streams function, which is also present in the main PR. Both PRs include updates to function signatures and parameters, indicating a direct relationship in their functionality.

Poem

🐇 In the meadow where streams flow,
New features bloom, watch them grow!
With tasks in the background, swift and bright,
Scrapers dance in the soft moonlight.
From Prowlarr feeds to torrents galore,
Our code hops forward, forevermore! 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai generate interesting stats about this repository and render them as a table.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

Outside diff range and nitpick comments (4)

scrapers/utils.py (1)

11-48: LGTM! The new run_scrapers function is well-structured and modular.

The function effectively orchestrates the scraping of media streams from multiple sources based on user-defined settings and metadata. The modular approach enhances the flexibility and scalability of the scraping process, enabling the integration of additional scrapers in the future if needed.

Some additional suggestions for improvement:

Consider extracting the logic for adding scraping tasks into separate functions for each scraper. This will further improve the readability and maintainability of the code.

Consider adding error handling and logging for each scraping task to better track and manage any potential issues that may arise during the scraping process.

Consider adding type hints for the scraped_streams variable to improve code clarity and catch potential type-related issues early.

Overall, the changes look good and the function is well-designed.

db/config.py (1)

87-87: Clarify the unit of measurement for prowlarr_feed_scrape_interval.

The addition of the prowlarr_feed_scrape_interval attribute enhances the configurability of the Settings class. However, please consider clarifying the unit of measurement (e.g., seconds, minutes, hours) for the interval value to improve code clarity.
scrapers/base_scraper.py (1)
82-105: LGTM with a minor suggestion!

The make_request method is a well-implemented method that makes an HTTP request with retry logic using the tenacity library. The method handles httpx.HTTPStatusError and httpx.RequestError exceptions and raises a ScraperError with appropriate error messages. The method implementation is correct and doesn't require any major changes.

However, as suggested by the static analysis tool, it's recommended to use raise ... from err or raise ... from None when raising exceptions within an except clause to distinguish them from errors in exception handling.

Apply this diff to update the exception raising:
-            raise ScraperError(f"HTTP error occurred: {e}")
+            raise ScraperError(f"HTTP error occurred: {e}") from e

-            raise ScraperError(f"An error occurred while requesting {e.request.url!r}.")
+            raise ScraperError(f"An error occurred while requesting {e.request.url!r}.") from e
Tools

Ruff

102-102: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

105-105: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)
scrapers/prowlarr.py (1)
669-671: Consider simplifying the expression.

The static analysis tool suggests replacing the current ternary operator with a more concise expression:
category_ids = category_ids if category_ids else [category["id"] for category in prowlarr_data.get("categories", [])]
This change improves readability and simplifies the code.

Tools

Ruff

669-671: Use category_ids if category_ids else [category["id"] for category in prowlarr_data.get("categories", [])] instead of [category["id"] for category in prowlarr_data.get("categories", [])] if not category_ids else category_ids

Replace with category_ids if category_ids else [category["id"] for category in prowlarr_data.get("categories", [])]

(SIM212)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c1857ff and 49ed72a.

Files ignored due to path filters (3)

Pipfile.lock is excluded by !**/*.lock
resources/images/logo_text.png is excluded by !**/*.png
resources/images/poster_template.jpg is excluded by !**/*.jpg

Files selected for processing (20)

Pipfile (1 hunks)
api/main.py (5 hunks)
api/scheduler.py (2 hunks)
api/task.py (1 hunks)
db/config.py (2 hunks)
db/crud.py (9 hunks)
db/models.py (3 hunks)
resources/json/scraper_config.json (2 hunks)
scrapers/base_scraper.py (1 hunks)
scrapers/imdb_data.py (3 hunks)
scrapers/prowlarr.py (2 hunks)
scrapers/prowlarr_feed.py (1 hunks)
scrapers/routes.py (5 hunks)
scrapers/torrentio.py (1 hunks)
scrapers/utils.py (1 hunks)
scrapers/zilean.py (1 hunks)
streaming_providers/realdebrid/utils.py (1 hunks)
streaming_providers/routes.py (2 hunks)
utils/network.py (4 hunks)
utils/parser.py (1 hunks)

Files skipped from review due to trivial changes (1)

resources/json/scraper_config.json

Additional context used

Ruff

api/task.py

10-10: scrapers.tv imported but unused

Remove unused import

(F401)

10-10: scrapers.imdb_data imported but unused

Remove unused import

(F401)

10-10: scrapers.trackers imported but unused

Remove unused import

(F401)

10-10: scrapers.helpers imported but unused

Remove unused import

(F401)

10-10: scrapers.prowlarr imported but unused

Remove unused import

(F401)

10-10: scrapers.prowlarr_feed imported but unused

Remove unused import

(F401)

scrapers/base_scraper.py

102-102: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

105-105: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

107-113: BaseScraper.validate_response is an empty method in an abstract base class, but has no abstract decorator

(B027)

115-132: BaseScraper.parse_response is an empty method in an abstract base class, but has no abstract decorator

(B027)

193-198: Use a single if statement instead of nested if statements

(SIM102)

scrapers/torrentio.py

93-100: Use a single if statement instead of nested if statements

(SIM102)

119-119: Using .strip() with multi-character strings is misleading

(B005)

api/main.py

512-512: Do not perform function call BackgroundTasks in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

scrapers/prowlarr.py

669-671: Use category_ids if category_ids else [category["id"] for category in prowlarr_data.get("categories", [])] instead of [category["id"] for category in prowlarr_data.get("categories", [])] if not category_ids else category_ids

Replace with category_ids if category_ids else [category["id"] for category in prowlarr_data.get("categories", [])]

(SIM212)

Additional comments not posted (72)

Pipfile (1)

46-46: Clarify the rationale behind the change in parsett dependency declaration.

The parsett dependency declaration has been simplified from a specific Git repository reference to a wildcard version specification (*). While this allows for flexibility, it's important to consider:

What is the rationale behind this change? Is there a specific reason to decouple from the Git repository?

Does the parsett package follow strict semantic versioning? If not, using a wildcard version could lead to unexpected behavior if breaking changes are introduced in future releases.

Please provide more context on these aspects to ensure the stability and predictability of the dependency.

streaming_providers/realdebrid/utils.py (1)

97-98: LGTM!

The added conditional check improves the robustness of the function by ensuring that the subsequent logic, which updates the cached status of each stream, only executes when there is valid availability data. This change alters the control flow and avoids unnecessary operations when no data is available.

db/config.py (2)

30-32: LGTM!

The reformatting of the logo_url string enhances code clarity without affecting functionality.

88-89: LGTM!

The addition of the prowlarr_feed_scraper_crontab and disable_prowlarr_feed_scraper attributes enhances the configurability of the Prowlarr feed scraping functionality. The default values seem reasonable.

db/models.py (6)

5-5: LGTM!

The pytz library is a good choice for handling timezones in Python. It will be useful for converting timestamps to UTC in the validate_created_at method.

47-47: LGTM!

The indexer_flags field is a useful addition to store metadata related to the indexer. Making it optional with a default empty list is a good practice.

49-55: LGTM!

Implementing __eq__ and __hash__ methods based on the id field is a good practice. It allows instances of TorrentStreams to be compared and used in hash-based collections while ensuring unique identification.

57-59: LGTM!

Converting the id to lowercase in the field validator is a good practice. It ensures consistency and avoids case-sensitivity issues when storing and comparing IDs.

61-64: LGTM!

Converting the created_at timestamp to UTC in the field validator is a good practice. It ensures that all timestamps are stored in a consistent timezone, which is important for accurate comparisons and sorting.

133-137: LGTM!

Formatting the runtime as a string with "min" appended in the field validator is a good practice. It improves readability and ensures that the runtime is consistently stored as a string indicating the duration in minutes.

scrapers/zilean.py (5)

18-21: LGTM!

The __init__ method is correctly initializing the ZileanScraper instance by calling the parent class constructor and setting the necessary attributes.

23-47: LGTM!

The scrape_and_parse method is well-structured and correctly handles the scraping and parsing of streams from the Zilean API. The use of decorators for caching and rate limiting, error handling, response validation, and delegation of response parsing to a separate method are all good practices.

50-62: LGTM!

The fetch_stream_data method is correctly implemented to fetch stream data asynchronously from the Zilean API. The use of the httpx library, inclusion of necessary settings, error handling, logging, and returning of the response JSON are all done properly.

64-77: LGTM!

The parse_response method is properly implemented to parse the response from the Zilean API and return a list of TorrentStreams objects. The creation of tasks for each stream, concurrent execution using asyncio.gather, and filtering of None values from the results are all done correctly.

79-145: LGTM!

The process_stream method is well-implemented to process a single stream from the Zilean API response and return a TorrentStreams object. The use of the semaphore for concurrency control, filtering of inappropriate content, parsing and validation of the torrent title, creation of the TorrentStreams object with the appropriate catalog, and handling of season and episode data for series are all done correctly.

scrapers/prowlarr_feed.py (7)

40-72: LGTM!

The scrape_prowlarr_feed function provides a robust and efficient mechanism to scrape and process items from the Prowlarr feed. It employs a circuit breaker pattern to handle failures gracefully and processes items in batches to improve performance. The function is well-structured and follows best practices.

75-125: LGTM!

The process_feed_item function provides a comprehensive logic to process individual items from the Prowlarr feed. It handles various scenarios such as already processed items, blacklisted keywords, and unsupported categories. It fetches or creates metadata based on the available information and processes and stores the stream if metadata is available. The function is well-structured and follows best practices.

128-132: LGTM!

The get_metadata_by_id function provides a simple and clear way to fetch metadata based on the IMDb ID and media type. It abstracts the logic of fetching metadata for different media types by calling the appropriate function based on the media type. The function is concise and easy to understand.

135-142: LGTM!

The search_and_create_metadata function provides a way to search and create metadata based on the provided information. It ensures that metadata is only created for IMDb movies and series by checking the metadata ID. It fetches the newly created metadata to return the complete metadata information. The function is well-structured and follows best practices.

145-155: LGTM!

The run_prowlarr_feed_scraper function serves as the entry point for running the Prowlarr feed scraper. It is decorated with the @minimum_run_interval decorator to ensure that the scraper runs at a specified minimum interval, avoiding excessive or unnecessary scraping. The function is also decorated with the @dramatiq.actor decorator, allowing it to be executed as a background task using Dramatiq, with configurable settings for time limit, retries, backoff, and priority. The function is concise and effectively initiates the scraping process.

31-32: LGTM!

The is_item_processed function provides a simple and efficient way to check if an item has already been processed. It utilizes Redis to store and check the processed items, allowing for fast lookups. The function is concise and effectively determines the processing status of an item.

35-37: LGTM!

The mark_item_as_processed function provides a way to mark an item as processed by adding its ID to the set of processed items in Redis. It ensures that the processed items set has an expiry time based on the PROCESSED_ITEMS_EXPIRY constant, preventing indefinite growth and allowing for the reprocessing of items after a certain period. The function is concise and effectively marks an item as processed.

scrapers/base_scraper.py (7)

16-17: LGTM!

The ScraperError class is a simple custom exception class that inherits from Exception. The class definition is correct and doesn't require any changes.

21-24: LGTM!

The __init__ method is a standard constructor that initializes the class attributes correctly. The method implementation is correct and doesn't require any changes.

26-27: LGTM!

The __aenter__ method is a standard method that is part of the asynchronous context manager protocol. The method implementation is correct and doesn't require any changes.

29-30: LGTM!

The __aexit__ method is a standard method that is part of the asynchronous context manager protocol. The method implementation is correct and closes the httpx.AsyncClient instance properly.

32-38: LGTM!

The scrape_and_parse method is an abstract method that is intended to be implemented by subclasses. The method is correctly decorated with @abc.abstractmethod and includes a clear docstring describing its purpose. The method definition is correct and doesn't require any changes.

40-61: LGTM!

The cache method is a well-implemented decorator that caches the results of a method using Redis. The method generates a cache key, checks if the result is already cached, and either returns the cached result or calls the decorated function and stores the result in Redis with the specified TTL. The method implementation is correct and doesn't require any changes.

63-80: LGTM!

The rate_limit method is a well-implemented decorator that rate limits method calls using the ratelimit library. The method uses the @sleep_and_retry and @limits decorators to enforce the rate limit based on the provided calls and period parameters. The method implementation is correct and doesn't require any changes.

api/scheduler.py (1)

234-244: LGTM!

The changes introduce a new scheduled job for running the Prowlarr feed scraper. The job is conditionally added based on the settings.disable_prowlarr_feed_scraper flag, allowing for easy enabling/disabling of the feature. If enabled, the job is scheduled using a cron trigger based on the settings.prowlarr_feed_scrape_interval, ensuring periodic execution of the scraper.

The code segment follows the existing pattern for adding scheduled jobs and properly invokes the run_prowlarr_feed_scraper.send function when triggered. The cron expression is correctly passed as a keyword argument to the job.

Overall, the changes are well-structured, adhere to the existing codebase conventions, and enhance the functionality by automating the Prowlarr feed scraping process.

scrapers/imdb_data.py (6)

149-156: LGTM!

The use of batch_process_with_circuit_breaker to process movie IDs asynchronously is an efficient approach. The circuit breaker pattern is being used correctly to handle failures and prevent cascading failures. Retrying on IMDbDataAccessError is a good practice to handle temporary failures.

157-180: LGTM!

Processing the results from batch_process_with_circuit_breaker and updating the database entries immediately for each processed movie is an efficient approach. It improves responsiveness and potentially reduces memory usage compared to collecting all the results before updating the database.

The logging statement is helpful for tracking the updates.

218-242: LGTM!

The new function get_episode_by_date is a useful addition to retrieve a specific episode from a TV series based on the release date. It utilizes the TVSeries model and the web.update_title method effectively to fetch and filter episodes.

The logic is clear and concise:

It creates a TVSeries instance and updates the title with the episodes filtered by the expected year.

It then filters the episodes to find the one with the matching release date.

If no matching episode is found, it returns None.

244-256: LGTM!

The new function get_season_episodes is a useful addition to retrieve all episodes of a specific season from a TV series. It utilizes the TVSeries model and the web.update_title method effectively to fetch and filter episodes.

The logic is clear and concise:

It creates a TVSeries instance and updates the title with the episodes filtered by the specified season.

It then retrieves all the episodes of the specified season using the get_episodes_by_season method.

2-2: LGTM!

The import of the date class from the datetime module is necessary for the new functions get_episode_by_date and get_season_episodes to handle date-related operations.

6-11: Verify the usage of the fuzz module.

The import of the math module and the TVSeries class from the cinemagoerng.model module are necessary for the changed code segments.

However, the fuzz module imported from the thefuzz package is not being used in the changed code segments. Please verify if it is being used in other parts of the code. If not, consider removing the unused import to keep the codebase clean.

scrapers/torrentio.py (8)

19-23: LGTM!

The TorrentioScraper class structure and constructor look good. The inheritance from BaseScraper is appropriate, and the necessary attributes are correctly initialized.

25-50: LGTM!

The scrape_and_parse method is well-structured and follows a clear flow. The use of decorators for caching and rate limiting is appropriate. Error handling is implemented correctly, and the response validation and parsing are delegated to separate methods, promoting separation of concerns.

52-53: LGTM!

The validate_response method performs a simple and clear validation check on the response structure. The logic is concise and easy to understand.

55-68: LGTM!

The parse_response method efficiently processes the stream data concurrently using asyncio.gather. It delegates the processing of individual streams to the process_stream method, promoting separation of concerns. The filtering of None values ensures that only valid streams are returned.

70-163: LGTM!

The process_stream method performs comprehensive processing of an individual stream. The adult content check and title validation are important safeguards. The parsing of the stream title is delegated to a separate method, promoting separation of concerns. The creation of TorrentStreams, Season, and Episode objects is handled appropriately based on the catalog type. Error handling ensures that exceptions during processing are caught and logged.

Tools

Ruff

93-100: Use a single if statement instead of nested if statements

(SIM102)

119-119: Using .strip() with multi-character strings is misleading

(B005)

165-182: LGTM!

The parse_stream_title method effectively parses the stream title and extracts relevant information. The use of PTT.parse_title is appropriate for parsing the torrent name. The extracted data is organized into a dictionary for easy access and further processing.

184-212: LGTM!

The static methods extract_seeders, extract_languages_from_title, extract_languages, and extract_size_string provide utility functions for extracting specific information from the stream data. The use of regular expressions is appropriate for pattern matching and extraction. The methods are focused and perform their specific tasks effectively.

93-100: Skipping static analysis hints.

The nested if statements at lines 93-100 are used for conditional logic and are not overly complex. Combining them into a single if statement may reduce readability. The use of .lstrip() at line 119 is appropriate for removing the "tracker:" prefix from the tracker URLs and is not misleading in this context.

Also applies to: 119-119

Tools

Ruff

93-100: Use a single if statement instead of nested if statements

(SIM102)

utils/network.py (3)

36-62: Improved exception handling and state management in the call method.

The changes to the call method enhance its functionality and robustness:

The additional item parameter allows the method to return the item alongside the result or exception, enabling more graceful exception handling.

The refined state management logic, particularly in the half-open state, improves the behavior of the circuit breaker by checking the failure count against the half-open attempts threshold before transitioning to the closed state.

These modifications contribute to better error handling and more accurate state transitions in the circuit breaker implementation.

75-138: Improved performance, reliability, and observability in batch_process_with_circuit_breaker.

The modifications to the batch_process_with_circuit_breaker function bring several enhancements:

Yielding results as they become available optimizes memory usage and processing time, improving overall performance.

Utilizing asyncio.TaskGroup enables concurrent execution of tasks, further boosting performance.

The enhanced retry logic, with a clearer separation of successful results and retryable exceptions, improves the reliability of the batch processing by handling failures more effectively.

The added logging statements provide better visibility into the processing flow and retry attempts, aiding in debugging and monitoring.

These changes contribute to a more efficient, reliable, and observable batch processing function.

205-205: Updated function signature to accurately reflect the return type.

The change in the function signature of get_user_public_ip from str to str | None accurately reflects the possibility of returning None when the user's IP address is a private IP address. This improves the clarity and correctness of the function's return type.

streaming_providers/routes.py (2)

69-77: LGTM! Remember to address the TODO comment in the future.

The changes improve the robustness of the stream retrieval process by adding a fallback mechanism for case sensitivity. This accommodates legacy data formats while maintaining compatibility.

Please ensure to remove the uppercase fallback in the future as indicated by the TODO comment, once the legacy data has been migrated.

188-188: LGTM!

Converting the info_hash to lowercase ensures consistency in how the hash is processed throughout the streaming provider endpoint. This aligns with the changes made in the fetch_stream_or_404 function and ensures that the caching and locking mechanisms operate consistently, regardless of the case of the provided info_hash.

scrapers/routes.py (2)

296-323: LGTM!

The handle_movie_stream_store function encapsulates the logic for storing a movie torrent stream in a structured manner. It creates a new TorrentStreams instance using attributes from the parsed data, sets the updated_at timestamp to the current datetime, and logs the creation of the movie stream. The function ensures that the necessary attributes are populated in the TorrentStreams instance.

326-379: LGTM!

The handle_series_stream_store function encapsulates the logic for storing a series torrent stream in a structured manner. It ensures that the torrent pertains to a single season and prepares episode data based on the availability of detailed file data or basic episode numbers. If no valid episode data is found, it skips the torrent. The inclusion of the Season object in the TorrentStreams instance provides a structured representation of the episode data.

utils/parser.py (1)

90-90: Approved: Improved error logging.

The change from logging.error to logging.exception enhances the error handling by capturing and logging the full traceback of exceptions. This provides more context and detail about the error, making it easier for developers to trace the source of the issue and debug effectively.

api/main.py (3)

16-16: LGTM!

The import statement is correct.

48-48: LGTM!

Including the filename and line number in the log messages improves traceability during debugging. The updated logging format is correct.

554-554: Verify the function signature change in the codebase.

The function signature is updated to include the background_tasks parameter. Ensure that all function calls to get_streams have been updated to pass the background_tasks argument.

Run the following script to verify the function usage:

Also applies to: 565-565

scrapers/prowlarr.py (10)

71-96: LGTM!

The scrape_and_parse method looks good. It handles the scraping and parsing of streams, catches exceptions, and logs relevant information.

98-118: LGTM!

The _scrape_and_parse method correctly routes the scraping based on the catalog type. It raises an error for unsupported catalog types, which is a good practice.

120-147: LGTM!

The scrape_movie method looks good. It scrapes movie streams using both IMDb ID and title search (if enabled), processes the scraped streams, and sends a background search task (if enabled).

153-191: LGTM!

The scrape_series method looks good. It scrapes series streams using both IMDb ID and title search (if enabled), processes the scraped streams, and sends a background search task (if enabled).

199-278: LGTM!

The process_streams method looks good. It efficiently processes streams from multiple generators concurrently using a queue. It respects the processing limits, handles exceptions, and logs relevant information.

280-290: LGTM!

The fetch_stream_data method looks good. It makes a request to the Prowlarr API using the provided parameters and returns the JSON response. It also raises any HTTP errors, which is a good practice.

292-306: LGTM!

The build_search_params method looks good. It correctly formats the search query based on the search type and returns the required parameters for the Prowlarr API.

562-603: LGTM!

The parse_prowlarr_data method looks good. It extracts the download URL, fetches the torrent data, handles exceptions, and updates the torrent data with additional metadata. The error handling and returning None in case of errors is a good practice.

696-725: LGTM!

The background_movie_title_search function looks good. It performs a background movie title search using the ProwlarrScraper. It fetches the movie metadata, generates title search queries, processes the scraped streams, and stores them. The use of minimum_run_interval and dramatiq.actor decorators is appropriate for a background task.

729-767: LGTM!

The background_series_title_search function looks good. It performs a background series title search using the ProwlarrScraper. It fetches the series metadata, generates title search queries, processes the scraped streams, and stores them. The use of minimum_run_interval and dramatiq.actor decorators is appropriate for a background task.

db/crud.py (7)

Line range hint 297-330: LGTM!

The changes to get_movie_streams function improve code modularity, performance, and functionality:

The addition of BackgroundTasks allows for asynchronous processing of new streams.

The run_scrapers function consolidates the scraping logic for movies.

Storing new streams asynchronously using a background task enhances performance.

The function correctly combines new and cached streams to provide a comprehensive set of streams.

Line range hint 333-370: LGTM!

The changes to get_series_streams function are similar to those in get_movie_streams and provide the same benefits:

The addition of BackgroundTasks allows for asynchronous processing of new streams.

The run_scrapers function consolidates the scraping logic for series.

Storing new streams asynchronously using a background task enhances performance.

The function correctly combines new and cached streams to provide a comprehensive set of streams.

374-414: LGTM!

The new store_new_torrent_streams function efficiently stores new torrent streams in the database:

The use of BulkWriter minimizes database round trips and improves performance.

The function correctly handles existing streams by updating them with new data.

It also handles adding new episodes to existing series streams.

The logging statements provide useful information for monitoring and debugging purposes.

Line range hint 692-717: LGTM!

The new get_or_create_metadata function efficiently retrieves or creates metadata for a given media type:

The function first checks for existing metadata, avoiding unnecessary database operations.

If metadata doesn't exist, the function creates a new metadata object using the provided data and IMDB data (if available).

The function correctly handles duplicate key errors by waiting and re-fetching the metadata, ensuring data consistency.

720-725: LGTM!

The new save_metadata function efficiently saves metadata and creates a new stream object:

The function avoids unnecessary operations by first checking if the stream already exists.

The function correctly retrieves or creates metadata using the get_or_create_metadata function.

The function creates a new stream object using the provided metadata and the create_stream_object function.

Line range hint 648-667: LGTM!

The create_metadata_object function correctly creates a new metadata object based on the provided metadata and IMDB data:

The function extracts relevant data from the provided metadata and IMDB data.

The function handles the case where the year is provided as a range, ensuring correct data format.

The function creates a new metadata object using the extracted data, ensuring data consistency.

Line range hint 669-689: LGTM!

The create_stream_object function correctly creates a new stream object based on the provided metadata:

The function extracts relevant data from the provided metadata.

The function handles the case where the stream is for a movie by including the filename and file index, ensuring correct data format.

The function creates a new TorrentStreams object using the extracted data, ensuring data consistency.

coderabbitai · 2024-09-16T02:50:34Z

api/task.py

@@ -7,7 +7,7 @@
 # import background actors
 # noqa: F401
 from mediafusion_scrapy import task
-from scrapers import tv, imdb_data, trackers, helpers, prowlarr
+from scrapers import tv, imdb_data, trackers, helpers, prowlarr, prowlarr_feed


Remove unused imports.

The static analysis tool correctly points out that several imports, including the newly added prowlarr_feed import, are unused in this file.

Unless there are plans to use these imports in the near future, it's best to remove them to keep the codebase clean and maintainable.

Apply this diff to remove the unused imports:

-from scrapers import tv, imdb_data, trackers, helpers, prowlarr, prowlarr_feed +from scrapers import prowlarr_feed

If you intend to use these imports in upcoming commits, feel free to ignore this comment.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from scrapers import tv, imdb_data, trackers, helpers, prowlarr, prowlarr_feed

from scrapers import prowlarr_feed

Tools

Ruff

10-10: scrapers.tv imported but unused

Remove unused import

(F401)

10-10: scrapers.imdb_data imported but unused

Remove unused import

(F401)

10-10: scrapers.trackers imported but unused

Remove unused import

(F401)

10-10: scrapers.helpers imported but unused

Remove unused import

(F401)

10-10: scrapers.prowlarr imported but unused

Remove unused import

(F401)

10-10: scrapers.prowlarr_feed imported but unused

Remove unused import

(F401)

coderabbitai · 2024-09-16T02:50:35Z

scrapers/base_scraper.py

+    def parse_response(
+        self,
+        response: Dict[str, Any],
+        metadata: MediaFusionMetaData,
+        catalog_type: str,
+        season: int = None,
+        episode: int = None,
+    ) -> List[TorrentStreams]:
+        """
+        Parse the response into TorrentStreams objects.
+        :param response: Response dictionary
+        :param metadata: MediaFusionMetaData object
+        :param catalog_type: Catalog type (movie, series)
+        :param season: Season number (for series)
+        :param episode: Episode number (for series)
+        :return: List of TorrentStreams objects
+        """
+        pass


Add the @abc.abstractmethod decorator to the parse_response method.

Since the parse_response method is empty and is part of an abstract base class, it should be declared as an abstract method using the @abc.abstractmethod decorator. This will enforce the implementation of the method in the subclasses.

Apply this diff to add the @abc.abstractmethod decorator:

+ @abc.abstractmethod def parse_response( self, response: Dict[str, Any], metadata: MediaFusionMetaData, catalog_type: str, season: int = None, episode: int = None, ) -> List[TorrentStreams]: """ Parse the response into TorrentStreams objects. :param response: Response dictionary :param metadata: MediaFusionMetaData object :param catalog_type: Catalog type (movie, series) :param season: Season number (for series) :param episode: Episode number (for series) :return: List of TorrentStreams objects """ pass

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def parse_response(

self,

response: Dict[str, Any],

metadata: MediaFusionMetaData,

catalog_type: str,

season: int = None,

episode: int = None,

) -> List[TorrentStreams]:

"""

Parse the response into TorrentStreams objects.

:param response: Response dictionary

:param metadata: MediaFusionMetaData object

:param catalog_type: Catalog type (movie, series)

:param season: Season number (for series)

:param episode: Episode number (for series)

:return: List of TorrentStreams objects

"""

pass

@abc.abstractmethod

def parse_response(

self,

response: Dict[str, Any],

metadata: MediaFusionMetaData,

catalog_type: str,

season: int = None,

episode: int = None,

) -> List[TorrentStreams]:

"""

Parse the response into TorrentStreams objects.

:param response: Response dictionary

:param metadata: MediaFusionMetaData object

:param catalog_type: Catalog type (movie, series)

:param season: Season number (for series)

:param episode: Episode number (for series)

:return: List of TorrentStreams objects

"""

pass

Tools

Ruff

115-132: BaseScraper.parse_response is an empty method in an abstract base class, but has no abstract decorator

(B027)

coderabbitai · 2024-09-16T02:50:35Z

scrapers/base_scraper.py

+    def validate_response(self, response: Dict[str, Any]) -> bool:
+        """
+        Validate the response from the scraper.
+        :param response: Response dictionary
+        :return: True if valid, False otherwise
+        """
+        pass


Add the @abc.abstractmethod decorator to the validate_response method.

Since the validate_response method is empty and is part of an abstract base class, it should be declared as an abstract method using the @abc.abstractmethod decorator. This will enforce the implementation of the method in the subclasses.

Apply this diff to add the @abc.abstractmethod decorator:

+ @abc.abstractmethod def validate_response(self, response: Dict[str, Any]) -> bool: """ Validate the response from the scraper. :param response: Response dictionary :return: True if valid, False otherwise """ pass

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def validate_response(self, response: Dict[str, Any]) -> bool:

"""

Validate the response from the scraper.

:param response: Response dictionary

:return: True if valid, False otherwise

"""

pass

@abc.abstractmethod

def validate_response(self, response: Dict[str, Any]) -> bool:

"""

Validate the response from the scraper.

:param response: Response dictionary

:return: True if valid, False otherwise

"""

pass

Tools

Ruff

107-113: BaseScraper.validate_response is an empty method in an abstract base class, but has no abstract decorator

(B027)

coderabbitai · 2024-09-16T02:50:35Z

api/main.py

@@ -508,6 +509,7 @@ async def get_streams(
    season: int = None,
    episode: int = None,
    user_data: schemas.UserData = Depends(get_user_data),
+    background_tasks: BackgroundTasks = BackgroundTasks(),


Move the BackgroundTasks call within the function.

Performing the BackgroundTasks call in the argument defaults can lead to unexpected behavior.

Apply this diff to fix the issue:

- background_tasks: BackgroundTasks = BackgroundTasks(), + background_tasks: BackgroundTasks = None, ): + if background_tasks is None: + background_tasks = BackgroundTasks()

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

background_tasks: BackgroundTasks = BackgroundTasks(),

background_tasks: BackgroundTasks = None,

):

if background_tasks is None:

background_tasks = BackgroundTasks()

Tools

Ruff

512-512: Do not perform function call BackgroundTasks in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

mhdzumair added 6 commits September 1, 2024 22:59

Update Python version

98a59d8

Refactor Redis client

7961bf8

WIP: Refactor scrapers

0c5d203

Merge branch 'main' of github.com:mhdzumair/MediaFusion into scrapers

dd44133

Added Prowlarr feed scraping & Improve Advanced scraping capability f…

c56c431

…or prowlarr, zilean, torrentio & more bugfixes & improvements

Improve batch_process_with_circuit_breaker implementation to yield an…

49ed72a

… individual task as soon as completed

coderabbitai bot reviewed Sep 16, 2024

View reviewed changes

mhdzumair merged commit cf8ed59 into main Sep 16, 2024

mhdzumair deleted the scrapers branch September 16, 2024 02:56

mhdzumair mentioned this pull request Sep 16, 2024

Allow episodes to be added to existing DB entries in the torrentio scraper #273

Closed

coderabbitai bot mentioned this pull request Sep 17, 2024

Add Support for download streams via web browser & support disabling the metadata for imdb title #287

Merged

coderabbitai bot mentioned this pull request Oct 13, 2024

Add ScrapeOps monitoring, Add zilean filtered endpoint, Fix Prowlarr scraping with download link & Fix PikPak login error etc #319

Merged

This was referenced Nov 19, 2024

Add StremThru as streaming provider #345

Merged

Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Prowlarr feed scraping & Improve Advanced scraping capability for prowlarr, zilean, torrentio & more bugfixes & improvements #286

Added Prowlarr feed scraping & Improve Advanced scraping capability for prowlarr, zilean, torrentio & more bugfixes & improvements #286

mhdzumair commented Sep 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Sep 16, 2024

coderabbitai bot Sep 16, 2024

coderabbitai bot Sep 16, 2024

coderabbitai bot Sep 16, 2024

	from scrapers import tv, imdb_data, trackers, helpers, prowlarr, prowlarr_feed
	from scrapers import prowlarr_feed

-    background_tasks: BackgroundTasks = BackgroundTasks(),
+    background_tasks: BackgroundTasks = None,
+):
+    if background_tasks is None:
+        background_tasks = BackgroundTasks()

Added Prowlarr feed scraping & Improve Advanced scraping capability for prowlarr, zilean, torrentio & more bugfixes & improvements #286

Added Prowlarr feed scraping & Improve Advanced scraping capability for prowlarr, zilean, torrentio & more bugfixes & improvements #286

Conversation

mhdzumair commented Sep 16, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Sep 16, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 16, 2024

Choose a reason for hiding this comment

mhdzumair commented Sep 16, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)