Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367

mhdzumair · 2024-11-21T07:06:32Z

Summary by CodeRabbit

Release Notes

New Features
- Enhanced scraping capabilities with improved indexer health checks and metrics tracking.
- Introduced a new ScraperMetrics class for detailed insights into the scraping process.
Bug Fixes
- Improved error handling for duplicate entries and HTTP requests across various scraping modules.
Improvements
- Reduced search query timeout for Prowlarr settings from 120 seconds to 15 seconds.
- Streamlined logic for handling series metadata retrieval and processing.
- Enhanced error handling and metrics tracking for various scraping processes, ensuring more robust performance.
Documentation
- Updated logging and metrics collection for better monitoring of scraping activities.

Renamed scrape_and_parse to _scrape_and_parse, adding comprehensive error handling and logging metrics for various scenarios like timeouts, HTTP errors, and validation issues. Incorporated a metrics class in base_scraper.py to summarize statistics related to scraping performance, which is now used across different scrapers including prowlarr, torrentio, and zilean.

…thcheck management & comprehensive metrics summary Added detailed health checks for indexers and included their statuses in logging. Enhanced background search to use circuit breakers and handle indexers in chunks, improving reliability and fault tolerance.

…for moving to imdb title

Reorganized Circuit Breaker class for better clarity and maintainability. Added methods to manage states (`is_closed`, `reset`, `record_failure`, `record_success`) and refactored `call` method to utilize these state checks. Enhanced logging and status reporting with `get_status` method.

Streamlined the `get_series_meta` function by removing complex filtering conditions and reducing the aggregation pipeline.

coderabbitai · 2024-11-21T07:06:39Z

Caution

Review failed

The pull request is closed.

Walkthrough

This pull request introduces several modifications across multiple files, primarily focusing on enhancing error handling, simplifying logic for data retrieval, and improving metrics tracking in scraping processes. Key changes include a reduction in the default timeout for search queries, updates to the handling of duplicate keys in movie and series data, and the introduction of a new ScraperMetrics class for tracking scraping performance. Additionally, various scraper classes have been updated to better manage indexer health and improve error logging.

Changes

File	Change Summary
`db/config.py`	Default value of `prowlarr_search_query_timeout` changed from `120` to `15` seconds.
`db/crud.py`	Enhanced error handling for `get_movie_data_by_id` and `get_series_data_by_id` functions; simplified `get_series_meta` function.
`scrapers/base_scraper.py`	Introduced `ScraperMetrics` class; updated `scrape_and_parse` method to track metrics; changed `parse_response` to an asynchronous method.
`scrapers/helpers.py`	Added error handling to `get_page_bs4` function for asynchronous HTTP requests.
`scrapers/prowlarr.py`	Added `get_healthy_indexers` method; improved indexer health checks and error handling; updated scraping methods to process indexer chunks.
`scrapers/prowlarr_feed.py`	Restructured `scrape_prowlarr_feed` function to include metrics collection and improved error handling; added new helper functions for media type and metadata fetching.
`scrapers/torrentio.py`	Enhanced error handling and metrics recording in `TorrentioScraper`; renamed `scrape_and_parse` to `_scrape_and_parse`.
`scrapers/zilean.py`	Improved error handling and metrics tracking in `ZileanScraper`; renamed `scrape_and_parse` to `_scrape_and_parse`.
`utils/network.py`	Added methods to `CircuitBreaker` class for state management and improved logging in `batch_process_with_circuit_breaker` function.

Possibly related PRs

Added Prowlarr feed scraping & Improve Advanced scraping capability for prowlarr, zilean, torrentio & more bugfixes & improvements #286: Enhancements to the ProwlarrScraper class, which may relate to the Settings class modifications in the main PR as both deal with configurations affecting scraping behavior.
Add Support for download streams via web browser & support disabling the metadata for imdb title #287: Introduces new features for downloading streams and disabling metadata, relevant to configuration changes in the main PR that affect how settings are managed.
Fallback imdb metadata scraping via cinemeta & Scrapy Ratelimit middleware & metadata empty cache bugfixes #290: Modifications in this PR include changes to the get_series_data_by_id function, which may relate to the configuration settings in the main PR affecting data retrieval and processing.
#312: Fix IMDb data scraping #313: Changes enhance the encrypt_user_data function, which may be relevant to the Settings class if user data encryption is tied to configuration settings.
Add configuration mode & Refactor stream filtering and catalog configuration #340: Focuses on simplifying logic in the get_streams function and enhancing error handling, aligning with configuration management changes in the main PR.
Add Configuration validations & loading widget for responsiveness #342: Introduces validation functions for user data and enhances the loading widget, relating to configuration changes in the main PR that improve user experience and data handling.

🐇 In the code, we hop and play,
With metrics and errors, we find our way.
From timeouts reduced, to scrapers that gleam,
We gather our data, fulfilling the dream.
So here’s to the changes, both big and small,
A leap for the future, we celebrate all! 🥕

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (18)

scrapers/helpers.py (1)
103-111: Consider enhancing error handling and configurability.

While the current implementation is solid, consider these improvements:

Use specific exception types (e.g., TimeoutError, ConnectionError) for better error handling

Make the timeout configurable via settings

Add a retry mechanism for transient failures

Consider versioning the Chrome impersonation string

Here's a suggested implementation:
+from typing import Optional
+from tenacity import retry, stop_after_attempt, wait_exponential
+
+DEFAULT_TIMEOUT = 30
+
-async def get_page_bs4(url: str):
+async def get_page_bs4(url: str, timeout: int = DEFAULT_TIMEOUT) -> Optional[BeautifulSoup]:
+    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def _fetch_with_retry(session: AsyncSession, url: str) -> BeautifulSoup:
+        response = await session.get(
+            url,
+            impersonate="chrome110", # Specify version for better maintainability
+            timeout=timeout
+        )
+        if response.status_code != 200:
+            raise ValueError(f"HTTP {response.status_code}")
+        return BeautifulSoup(response.text, "html.parser")
+
     try:
         async with AsyncSession(proxies=PROXIES) as session:
-            response = await session.get(url, impersonate="chrome", timeout=30)
-            if response.status_code != 200:
-                return None
-            return BeautifulSoup(response.text, "html.parser")
-    except Exception as e:
+            return await _fetch_with_retry(session, url)
+    except ValueError as e:
+        logging.error(f"Invalid response for {url}: {e}")
+        return None
+    except TimeoutError as e:
+        logging.error(f"Timeout fetching {url}: {e}")
+        return None
+    except ConnectionError as e:
+        logging.error(f"Connection error for {url}: {e}")
+        return None
+    except Exception as e:  # Catch unexpected errors
         logging.error(f"Error fetching page: {url}, error: {e}")
         return None
db/config.py (1)
39-39: Add a comment explaining the timeout rationale

Consider adding a comment explaining why 15 seconds was chosen as the optimal timeout value. This helps future maintainers understand the reasoning behind this specific duration.
-    prowlarr_search_query_timeout: int = 15
+    # 15s timeout balances responsiveness with slower indexers' needs
+    prowlarr_search_query_timeout: int = 15
scrapers/zilean.py (6)
22-22: Consider maintaining consistency in logger naming

Changing the logger_name to __name__ will set the logger to use the module name rather than the class name. If the intention is to have logs categorized by class name for clarity, consider using self.__class__.__name__ instead.

Apply this diff to revert to the class name:
-        super().__init__(cache_key_prefix=self.cache_key_prefix, logger_name=__name__)
+        super().__init__(cache_key_prefix=self.cache_key_prefix, logger_name=self.__class__.__name__)
72-74: Correct grammatical error in log message

The log message should use "searching" instead of "search" for proper grammar.

Apply this diff to fix the log message:
             self.logger.error(
-                f"Error occurred while search {metadata.title}: {search_response}"
+                f"Error occurred while searching {metadata.title}: {search_response}"
             )
130-130: Clarify function name for grammatical correctness

The function name is_contain_18_plus_keywords is grammatically incorrect. Consider renaming it to contains_18_plus_keywords for better readability.

Apply this diff if the function is defined in this module:
-    if is_contain_18_plus_keywords(stream["raw_title"]):
+    if contains_18_plus_keywords(stream["raw_title"]):
Ensure that the function definition and all its references are updated accordingly.

133-135: Avoid logging sensitive content in warnings

Logging the full title of adult content may expose sensitive information. Consider masking the title or reducing the log level to prevent potential privacy issues.

Apply this diff to modify the log message:
             self.logger.warning(
-                f"Stream contains 18+ keywords: {stream['raw_title']}"
+                "Stream contains 18+ keywords in the title."
             )
166-168: Handle multiple seasons more gracefully

Currently, torrents with multiple seasons are skipped without processing. Depending on the use case, you might want to handle multi-season torrents instead of skipping them.

Consider implementing logic to process torrents that cover multiple seasons, or provide a more descriptive skip reason if this is intentional.

197-200: Handle specific exceptions where possible

Catching the broad Exception class can mask other issues. Consider catching more specific exceptions to improve error handling and debugging.

Apply this diff to catch specific exceptions:
             return torrent_stream

-        except Exception as e:
+        except KeyError as e:
             self.metrics.record_error("stream_processing_error")
             self.logger.exception(f"Error processing stream: {e}")
             return None
Adjust the exception type based on the specific errors you expect.
scrapers/prowlarr_feed.py (2)
52-52: Consider making chunk size configurable

To enhance flexibility and maintainability, consider making the chunk size parameter in split_indexers_into_chunks configurable instead of hardcoding it to 3.

62-62: Use predefined constants for category IDs

To improve readability and maintainability, replace hardcoded category IDs with predefined constants like MOVIE_CATEGORY_IDS, SERIES_CATEGORY_IDS, and OTHER_CATEGORY_IDS.

Apply this diff to make the change:
     "categories": [
-        2000, 5000, 8000  # Movies, TV, and Other categories
+        *MOVIE_CATEGORY_IDS, *SERIES_CATEGORY_IDS, *OTHER_CATEGORY_IDS  # Movies, TV, and Other categories
     ],
scrapers/torrentio.py (2)
25-25: Consider consistency in logger naming

The logger is now initialized with __name__ instead of self.__class__.__name__. Ensure this change aligns with the logging conventions used across other scrapers for consistency.

210-211: Consider handling parsing exceptions without re-raising

Currently, when an exception occurs in parse_stream_title, the exception is recorded, and then re-raised. This could cause the entire scraping process to fail due to a single stream's parsing error. Consider returning None instead of raising the exception to allow the scraper to continue processing other streams.

Apply this diff to prevent the exception from halting the process:
 except Exception as e:
     self.metrics.record_error("title_parsing_error")
-    raise e
+    self.logger.warning(f"Failed to parse stream title: {e}")
+    return None
utils/network.py (1)
42-45: Simplify conditional statements in is_closed method

The if and elif blocks at lines 42-45 can be combined using a logical or operator or the in keyword for improved readability.

Apply this diff to simplify the code:
-    if self.state == "CLOSED":
-        return True
-    elif self.state == "HALF-OPEN":
-        return True
+    if self.state == "CLOSED" or self.state == "HALF-OPEN":
+        return True
Alternatively, using the in keyword:
-    if self.state == "CLOSED":
-        return True
-    elif self.state == "HALF-OPEN":
-        return True
+    if self.state in ("CLOSED", "HALF-OPEN"):
+        return True
🧰 Tools

🪛 Ruff

42-45: Combine if branches using logical or operator

Combine if branches

(SIM114)
scrapers/base_scraper.py (3)
244-244: Consider using a more descriptive scraper name

In the initialization of ScraperMetrics, the scraper_name is set to cache_key_prefix. Ensure that cache_key_prefix accurately represents the scraper's name for clarity in metrics reporting. If cache_key_prefix is not descriptive enough, consider providing a more explicit scraper name.

Line range hint 368-374: Ensure consistent typing for season and episode parameters

In the get_cache_key method, the parameters season and episode are typed as str = None, whereas in other methods like scrape_and_parse and parse_response, they are typed as int = None. This inconsistency might lead to unexpected behavior or type errors when generating cache keys.

Apply this diff to correct the parameter types:
     def get_cache_key(
         self,
         metadata: MediaFusionMetaData,
         catalog_type: str,
-        season: str = None,
-        episode: str = None,
+        season: int = None,
+        episode: int = None,
         *_args,
         **_kwargs,
     ) -> str:
         """
         Generate a cache key for the given arguments.
         :return: Cache key string
         ```

---

Line range hint `376-396`: **Mark `parse_response` and `validate_response` as abstract methods**

The methods `parse_response` and `validate_response` are intended to be implemented by subclasses but are not marked with the `@abc.abstractmethod` decorator. This could lead to subclasses not implementing these methods, potentially causing unexpected behavior at runtime.



Apply this diff to mark these methods as abstract methods:

```diff
+    @abc.abstractmethod
     async def parse_response(
         self,
         response: Dict[str, Any],
         metadata: MediaFusionMetaData,
         catalog_type: str,
         season: int = None,
         episode: int = None,
     ) -> List[TorrentStreams]:
         """
         Parse the response into TorrentStreams objects.
         """
         pass

+    @abc.abstractmethod
     def validate_response(self, response: Dict[str, Any]) -> bool:
         """
         Validate the response from the scraper.
         :param response: Response dictionary
         :return: True if valid, False otherwise
         """
         pass
scrapers/prowlarr.py (2)
448-499: Consider fetching search results concurrently to improve performance

In the fetch_search_results method, search queries are sent to each indexer sequentially within a for-loop. This can lead to increased total processing time, especially when dealing with a large number of indexers.

Consider using asynchronous tasks to fetch search results from multiple indexers concurrently, utilizing asyncio.gather. This can significantly improve the performance of the scraping process.

Here's a suggested refactor:
-        for indexer_id in indexer_ids:
-            indexer_status = self.indexer_status.get(indexer_id, {})
-            if not indexer_status.get("is_healthy", False):
-                continue
-            circuit_breaker = self.get_circuit_breaker(indexer_id)
-            indexer_name = indexer_status.get("name", f"ID:{indexer_id}")
-            if circuit_breaker.is_closed():
-                try:
-                    search_params = {**params, "indexerIds": [indexer_id]}
-                    async with httpx.AsyncClient(headers=self.headers) as client:
-                        response = await client.get(
-                            self.search_url,
-                            params=search_params,
-                            timeout=timeout,
-                        )
-                        response.raise_for_status()
-                        indexer_results = response.json()
-                        # Record success
-                        circuit_breaker.record_success()
-                        self.metrics.record_indexer_success(
-                            indexer_name, len(indexer_results)
-                        )
-                        results.extend(indexer_results)
-                    except Exception as e:
-                        # Exception handling as before
-                        pass
-            else:
-                # Handle circuit breaker open
-                pass
+        async def fetch_for_indexer(indexer_id):
+            indexer_status = self.indexer_status.get(indexer_id, {})
+            if not indexer_status.get("is_healthy", False):
+                return []
+            circuit_breaker = self.get_circuit_breaker(indexer_id)
+            indexer_name = indexer_status.get("name", f"ID:{indexer_id}")
+            if circuit_breaker.is_closed():
+                try:
+                    search_params = {**params, "indexerIds": [indexer_id]}
+                    async with httpx.AsyncClient(headers=self.headers) as client:
+                        response = await client.get(
+                            self.search_url,
+                            params=search_params,
+                            timeout=timeout,
+                        )
+                        response.raise_for_status()
+                        indexer_results = response.json()
+                        # Record success
+                        circuit_breaker.record_success()
+                        self.metrics.record_indexer_success(
+                            indexer_name, len(indexer_results)
+                        )
+                        return indexer_results
+                except Exception as e:
+                    # Exception handling as before
+                    return []
+            else:
+                # Handle circuit breaker open
+                return []
+
+        tasks = [fetch_for_indexer(indexer_id) for indexer_id in indexer_ids]
+        indexer_results_list = await asyncio.gather(*tasks)
+        # Flatten the list of results
+        results = [item for sublist in indexer_results_list for item in sublist]
986-1020: Reduce code duplication in background search tasks

The background_movie_title_search and background_series_title_search methods contain duplicated code for fetching healthy indexers, splitting them into chunks, initializing metrics, and creating search generators.

Consider refactoring the common logic into a shared helper function or method to promote code reuse and improve maintainability.

Also applies to: 1075-1108

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 67db120 and 930c5f4.

📒 Files selected for processing (9)

db/config.py (1 hunks)
db/crud.py (3 hunks)
scrapers/base_scraper.py (8 hunks)
scrapers/helpers.py (1 hunks)
scrapers/prowlarr.py (23 hunks)
scrapers/prowlarr_feed.py (3 hunks)
scrapers/torrentio.py (5 hunks)
scrapers/zilean.py (5 hunks)
utils/network.py (3 hunks)

🧰 Additional context used

🪛 Ruff

scrapers/base_scraper.py

379-396: BaseScraper.parse_response is an empty method in an abstract base class, but has no abstract decorator

(B027)

scrapers/prowlarr.py

419-419: Undefined name ExceptionGroup. Consider specifying requires-python = ">= 3.11" or tool.ruff.target-version = "py311" in your pyproject.toml file.

(F821)

utils/network.py

42-45: Combine if branches using logical or operator

Combine if branches

(SIM114)

🔇 Additional comments (18)

scrapers/helpers.py (1)

103-111: LGTM! Error handling has been implemented correctly.

The implementation properly handles HTTP request failures and invalid responses, preventing crashes and providing good error visibility through logging.

db/config.py (1)

39-39: Verify the impact of reducing the timeout from 120s to 15s

While reducing the timeout can improve system responsiveness and prevent resource exhaustion, 15 seconds might be too aggressive for some slower indexers. Let's verify the impact:

✅ Verification successful

Based on the search results, I can now generate a final response:

15 seconds timeout is appropriate and well-aligned with the codebase

The reduced timeout value is consistent with existing timeout patterns in the codebase:

Other Prowlarr-related operations use similar timeouts (e.g., 10s for indexer status checks)
The system has robust error handling for timeouts in prowlarr.py with:
- Circuit breaker pattern (2 failures threshold, 10s recovery)
- Proper error logging and metrics collection
- Retry mechanisms through background processing
Similar services in the codebase use comparable timeouts:
- Debrid client: 15s
- Base scraper client: 30s
- Most API calls: 10-15s

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for timeout-related error logs and any indexer performance metrics

# Search for timeout-related errors or warnings in the logs
rg -i "timeout|timed?.?out" --type py

# Look for any performance metrics or timing configurations
rg -i "prowlarr.*timeout|search.*timeout" --type py

# Check for any existing error handling related to timeouts
ast-grep --pattern 'try {
  $$$
  timeout
  $$$
} catch {
  $$$
}'

Length of output: 8649

scrapers/zilean.py (2)

87-90: Ensure validate_response method handles all edge cases

After recording the fetched items, the validate_response method is called. Ensure that this method thoroughly checks for all possible invalid or unexpected responses to prevent errors downstream.

Run the following script to review the validate_response method implementation:

✅ Verification successful

The validate_response method is appropriately implemented for its use case

The validate_response method in zilean.py performs the necessary validation for its specific requirements:

It checks if the response is a list type
It verifies the list is not empty
This validation is sufficient as the subsequent parse_response and process_stream methods handle the detailed validation of individual stream items

The error handling is comprehensive across the workflow:

Failed requests are logged and tracked via metrics
Invalid responses trigger "no_valid_streams" error
Stream processing errors are caught and tracked
Adult content is filtered
Invalid title/year matches are filtered
Missing season/episode information for series is handled

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Retrieve and display the `validate_response` method to ensure it handles all edge cases.

# Search for the `validate_response` method definition in the codebase.
ast-grep --pattern $'def validate_response($_):\n    $$$' --lang python

Length of output: 71

Script:

#!/bin/bash
# Let's try a broader search to find the validate_response method
rg -A 10 "validate_response" --type python

Length of output: 73

Script:

#!/bin/bash
# Let's try a simpler search without file type specification
rg -A 10 "validate_response"

# Also let's check the content of the zilean.py file to understand the context better
fd "zilean.py" --exec cat {}

Length of output: 10185

174-183: Review episode handling logic for potential issues

The logic for handling episodes might not cover all scenarios, such as when episodes are missing but the season matches. Ensure that all edge cases are addressed to prevent skipping valid streams.

Run the following script to check for streams where episodes are missing but seasons match:

✅ Verification successful

Let me gather more context about the episode handling logic and related code.

Let me check one more thing to understand how seasons are handled in this context.

The episode handling logic is correct and handles edge cases appropriately

The code follows a clear and logical flow for handling episodes:

First checks if explicit episode numbers are provided in torrent_data["episodes"]
If no episodes but the season matches, it defaults to episode 1 (common for single-episode seasons)
Otherwise, correctly skips the stream when episode info is missing

The season in seasons check is used after validating that seasons contains exactly one season number, making it a safe fallback for single-episode content.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find streams where episodes are missing but seasons match the metadata.

# Test: Search for such streams in the codebase.
# Note: Replace the placeholders with actual values if necessary.
rg -A 5 "if episodes := torrent_data.get\(\"episodes\"\)" --glob "**/*.py"

Length of output: 490

Script:

#!/bin/bash
# Get more context around the episode handling logic
rg -B 10 -A 10 "if episodes := torrent_data.get\(\"episodes\"\)" --glob "**/*.py"

# Find other episode-related logic in the codebase
rg "Episode\(" --glob "**/*.py"

# Look for season-related logic
rg "season in seasons" --glob "**/*.py"

Length of output: 3125

Script:

#!/bin/bash
# Get more context about how seasons are handled
rg -B 15 -A 5 "season in seasons" --glob "**/*.py"

# Look for the Season model/type definition
ast-grep --pattern 'class Season'

Length of output: 1479

scrapers/torrentio.py (9)

50-54: Approve addition of metrics for response validation

The added metrics for recording invalid responses and found items enhance monitoring and error tracking. This improvement is aligned with better error handling practices.

60-63: Approve enhanced error handling with specific metrics

Recording specific error metrics for request_failed and unexpected_error improves the ability to diagnose and address issues more effectively.

96-96: Approve skip metric for adult content streams

Recording skips for streams containing adult content enhances the granularity of metrics and assists in content filtering analysis.

145-145: Approve skip metric for multiple seasons torrents

Adding a skip metric for torrents with multiple seasons improves tracking of unprocessed items due to content structure, aiding in scraper performance analysis.

176-180: Approve recording of successful processing metrics

Recording metrics for processed items, quality, and source provides valuable insights into scraper performance and stream quality distribution.

184-184: Approve error handling in stream processing

Recording stream_processing_error enhances error tracking for individual streams and aids in debugging specific issues during processing.

196-199: Ensure convert_size_to_bytes handles empty or invalid size strings

In the code:

"size": convert_size_to_bytes(
    self.extract_size_string(stream["title"])
),

If extract_size_string returns an empty string or an invalid size format, convert_size_to_bytes may raise an exception. Ensure that this scenario is handled gracefully, possibly by validating the size string before conversion or handling exceptions within convert_size_to_bytes.

31-31: ⚠️ Potential issue

Verify references to the renamed method _scrape_and_parse

The method scrape_and_parse has been renamed to _scrape_and_parse, indicating it's intended for internal use. Verify that all references to this method have been updated accordingly to prevent any potential issues.

191-191: ⚠️ Potential issue

Handle potential unpacking errors when splitting stream title

If stream["title"] does not contain at least two lines, the unpacking in torrent_name, file_name = stream["title"].splitlines()[:2] could raise a ValueError. Consider adding a check to ensure that stream["title"] contains the expected number of lines before unpacking.

Apply this diff to handle cases where stream["title"] may not have the expected format:

 def parse_stream_title(self, stream: dict) -> dict:
     try:
-        torrent_name, file_name = stream["title"].splitlines()[:2]
+        lines = stream["title"].splitlines()
+        if len(lines) >= 2:
+            torrent_name, file_name = lines[:2]
+        else:
+            self.metrics.record_error("title_format_error")
+            self.logger.warning(f"Unexpected title format: {stream['title']}")
+            return None

Likely invalid or redundant comment.

scrapers/base_scraper.py (4)

21-233: Well-structured implementation of ScraperMetrics class

The ScraperMetrics class is well-designed to encapsulate metrics related to the scraping process. It effectively tracks various statistics and provides methods for starting, stopping, recording metrics, and formatting summaries. The use of @dataclass simplifies the class definition and enhances readability.

252-283: Robust implementation of scrape_and_parse method

The scrape_and_parse method has been effectively updated to initialize metrics collection, handle scraping, record errors, and log the summary upon completion. The use of try-except-finally ensures that metrics are properly recorded even if exceptions occur.

285-288: Correct use of @abc.abstractmethod for _scrape_and_parse

Marking _scrape_and_parse as an abstract method correctly enforces that subclasses must implement this method.

311-314: Properly skipping recent scrapes based on cache

The implementation correctly skips scraping when the item has been scraped recently, and records this action using self.metrics.skip_scrape().

db/crud.py (1)

532-583: Simplified aggregation pipeline improves efficiency and readability

The changes to the aggregation pipeline in the get_series_meta function simplify the logic for retrieving unique season and episode combinations, enhancing clarity and maintainability.

scrapers/prowlarr_feed.py

utils/network.py

db/crud.py

scrapers/prowlarr.py

mhdzumair added 6 commits November 21, 2024 11:46

Handle custom id items with imdb id based on title and year matching …

cb21620

…for moving to imdb title

handle exception onf fetching prowlarr indexer torrent page fetching

e446983

Simplify series metadata & episode data retrieval logic

930c5f4

Streamlined the `get_series_meta` function by removing complex filtering conditions and reducing the aggregation pipeline.

coderabbitai bot changed the title ~~@coderabbitai~~ Enhance error handling and metrics tracking in scraping processes Nov 21, 2024

coderabbitai bot reviewed Nov 21, 2024

View reviewed changes

mhdzumair changed the title ~~Enhance error handling and metrics tracking in scraping processes~~ Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking Nov 21, 2024

Refactor CircuitBreaker for web scraping and enhanced recovery

4db1269

mhdzumair merged commit ecf69fe into main Nov 21, 2024
1 check was pending

mhdzumair deleted the feature/scraper branch November 21, 2024 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367

Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367

mhdzumair commented Nov 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 21, 2024 •

edited

Loading

Review failed

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367

Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367

Conversation

mhdzumair commented Nov 21, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Nov 21, 2024 • edited Loading

Review failed

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

mhdzumair commented Nov 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 21, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)