-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced scraping capabilities with improved indexer search with health checks and metrics tracking #367
Conversation
Renamed scrape_and_parse to _scrape_and_parse, adding comprehensive error handling and logging metrics for various scenarios like timeouts, HTTP errors, and validation issues. Incorporated a metrics class in base_scraper.py to summarize statistics related to scraping performance, which is now used across different scrapers including prowlarr, torrentio, and zilean.
…thcheck management & comprehensive metrics summary Added detailed health checks for indexers and included their statuses in logging. Enhanced background search to use circuit breakers and handle indexers in chunks, improving reliability and fault tolerance.
…for moving to imdb title
Reorganized Circuit Breaker class for better clarity and maintainability. Added methods to manage states (`is_closed`, `reset`, `record_failure`, `record_success`) and refactored `call` method to utilize these state checks. Enhanced logging and status reporting with `get_status` method.
Streamlined the `get_series_meta` function by removing complex filtering conditions and reducing the aggregation pipeline.
Caution Review failedThe pull request is closed. WalkthroughThis pull request introduces several modifications across multiple files, primarily focusing on enhancing error handling, simplifying logic for data retrieval, and improving metrics tracking in scraping processes. Key changes include a reduction in the default timeout for search queries, updates to the handling of duplicate keys in movie and series data, and the introduction of a new Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (18)
scrapers/helpers.py (1)
103-111
: Consider enhancing error handling and configurability.While the current implementation is solid, consider these improvements:
- Use specific exception types (e.g.,
TimeoutError
,ConnectionError
) for better error handling- Make the timeout configurable via settings
- Add a retry mechanism for transient failures
- Consider versioning the Chrome impersonation string
Here's a suggested implementation:
+from typing import Optional +from tenacity import retry, stop_after_attempt, wait_exponential + +DEFAULT_TIMEOUT = 30 + -async def get_page_bs4(url: str): +async def get_page_bs4(url: str, timeout: int = DEFAULT_TIMEOUT) -> Optional[BeautifulSoup]: + @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10)) + async def _fetch_with_retry(session: AsyncSession, url: str) -> BeautifulSoup: + response = await session.get( + url, + impersonate="chrome110", # Specify version for better maintainability + timeout=timeout + ) + if response.status_code != 200: + raise ValueError(f"HTTP {response.status_code}") + return BeautifulSoup(response.text, "html.parser") + try: async with AsyncSession(proxies=PROXIES) as session: - response = await session.get(url, impersonate="chrome", timeout=30) - if response.status_code != 200: - return None - return BeautifulSoup(response.text, "html.parser") - except Exception as e: + return await _fetch_with_retry(session, url) + except ValueError as e: + logging.error(f"Invalid response for {url}: {e}") + return None + except TimeoutError as e: + logging.error(f"Timeout fetching {url}: {e}") + return None + except ConnectionError as e: + logging.error(f"Connection error for {url}: {e}") + return None + except Exception as e: # Catch unexpected errors logging.error(f"Error fetching page: {url}, error: {e}") return Nonedb/config.py (1)
39-39
: Add a comment explaining the timeout rationaleConsider adding a comment explaining why 15 seconds was chosen as the optimal timeout value. This helps future maintainers understand the reasoning behind this specific duration.
- prowlarr_search_query_timeout: int = 15 + # 15s timeout balances responsiveness with slower indexers' needs + prowlarr_search_query_timeout: int = 15scrapers/zilean.py (6)
22-22
: Consider maintaining consistency in logger namingChanging the
logger_name
to__name__
will set the logger to use the module name rather than the class name. If the intention is to have logs categorized by class name for clarity, consider usingself.__class__.__name__
instead.Apply this diff to revert to the class name:
- super().__init__(cache_key_prefix=self.cache_key_prefix, logger_name=__name__) + super().__init__(cache_key_prefix=self.cache_key_prefix, logger_name=self.__class__.__name__)
72-74
: Correct grammatical error in log messageThe log message should use "searching" instead of "search" for proper grammar.
Apply this diff to fix the log message:
self.logger.error( - f"Error occurred while search {metadata.title}: {search_response}" + f"Error occurred while searching {metadata.title}: {search_response}" )
130-130
: Clarify function name for grammatical correctnessThe function name
is_contain_18_plus_keywords
is grammatically incorrect. Consider renaming it tocontains_18_plus_keywords
for better readability.Apply this diff if the function is defined in this module:
- if is_contain_18_plus_keywords(stream["raw_title"]): + if contains_18_plus_keywords(stream["raw_title"]):Ensure that the function definition and all its references are updated accordingly.
133-135
: Avoid logging sensitive content in warningsLogging the full title of adult content may expose sensitive information. Consider masking the title or reducing the log level to prevent potential privacy issues.
Apply this diff to modify the log message:
self.logger.warning( - f"Stream contains 18+ keywords: {stream['raw_title']}" + "Stream contains 18+ keywords in the title." )
166-168
: Handle multiple seasons more gracefullyCurrently, torrents with multiple seasons are skipped without processing. Depending on the use case, you might want to handle multi-season torrents instead of skipping them.
Consider implementing logic to process torrents that cover multiple seasons, or provide a more descriptive skip reason if this is intentional.
197-200
: Handle specific exceptions where possibleCatching the broad
Exception
class can mask other issues. Consider catching more specific exceptions to improve error handling and debugging.Apply this diff to catch specific exceptions:
return torrent_stream - except Exception as e: + except KeyError as e: self.metrics.record_error("stream_processing_error") self.logger.exception(f"Error processing stream: {e}") return NoneAdjust the exception type based on the specific errors you expect.
scrapers/prowlarr_feed.py (2)
52-52
: Consider making chunk size configurableTo enhance flexibility and maintainability, consider making the chunk size parameter in
split_indexers_into_chunks
configurable instead of hardcoding it to3
.
62-62
: Use predefined constants for category IDsTo improve readability and maintainability, replace hardcoded category IDs with predefined constants like
MOVIE_CATEGORY_IDS
,SERIES_CATEGORY_IDS
, andOTHER_CATEGORY_IDS
.Apply this diff to make the change:
"categories": [ - 2000, 5000, 8000 # Movies, TV, and Other categories + *MOVIE_CATEGORY_IDS, *SERIES_CATEGORY_IDS, *OTHER_CATEGORY_IDS # Movies, TV, and Other categories ],scrapers/torrentio.py (2)
25-25
: Consider consistency in logger namingThe logger is now initialized with
__name__
instead ofself.__class__.__name__
. Ensure this change aligns with the logging conventions used across other scrapers for consistency.
210-211
: Consider handling parsing exceptions without re-raisingCurrently, when an exception occurs in
parse_stream_title
, the exception is recorded, and then re-raised. This could cause the entire scraping process to fail due to a single stream's parsing error. Consider returningNone
instead of raising the exception to allow the scraper to continue processing other streams.Apply this diff to prevent the exception from halting the process:
except Exception as e: self.metrics.record_error("title_parsing_error") - raise e + self.logger.warning(f"Failed to parse stream title: {e}") + return Noneutils/network.py (1)
42-45
: Simplify conditional statements inis_closed
methodThe
if
andelif
blocks at lines 42-45 can be combined using a logicalor
operator or thein
keyword for improved readability.Apply this diff to simplify the code:
- if self.state == "CLOSED": - return True - elif self.state == "HALF-OPEN": - return True + if self.state == "CLOSED" or self.state == "HALF-OPEN": + return TrueAlternatively, using the
in
keyword:- if self.state == "CLOSED": - return True - elif self.state == "HALF-OPEN": - return True + if self.state in ("CLOSED", "HALF-OPEN"): + return True🧰 Tools
🪛 Ruff
42-45: Combine
if
branches using logicalor
operatorCombine
if
branches(SIM114)
scrapers/base_scraper.py (3)
244-244
: Consider using a more descriptive scraper nameIn the initialization of
ScraperMetrics
, thescraper_name
is set tocache_key_prefix
. Ensure thatcache_key_prefix
accurately represents the scraper's name for clarity in metrics reporting. Ifcache_key_prefix
is not descriptive enough, consider providing a more explicit scraper name.
Line range hint
368-374
: Ensure consistent typing forseason
andepisode
parametersIn the
get_cache_key
method, the parametersseason
andepisode
are typed asstr = None
, whereas in other methods likescrape_and_parse
andparse_response
, they are typed asint = None
. This inconsistency might lead to unexpected behavior or type errors when generating cache keys.Apply this diff to correct the parameter types:
def get_cache_key( self, metadata: MediaFusionMetaData, catalog_type: str, - season: str = None, - episode: str = None, + season: int = None, + episode: int = None, *_args, **_kwargs, ) -> str: """ Generate a cache key for the given arguments. :return: Cache key string ``` --- Line range hint `376-396`: **Mark `parse_response` and `validate_response` as abstract methods** The methods `parse_response` and `validate_response` are intended to be implemented by subclasses but are not marked with the `@abc.abstractmethod` decorator. This could lead to subclasses not implementing these methods, potentially causing unexpected behavior at runtime. Apply this diff to mark these methods as abstract methods: ```diff + @abc.abstractmethod async def parse_response( self, response: Dict[str, Any], metadata: MediaFusionMetaData, catalog_type: str, season: int = None, episode: int = None, ) -> List[TorrentStreams]: """ Parse the response into TorrentStreams objects. """ pass + @abc.abstractmethod def validate_response(self, response: Dict[str, Any]) -> bool: """ Validate the response from the scraper. :param response: Response dictionary :return: True if valid, False otherwise """ passscrapers/prowlarr.py (2)
448-499
: Consider fetching search results concurrently to improve performanceIn the
fetch_search_results
method, search queries are sent to each indexer sequentially within a for-loop. This can lead to increased total processing time, especially when dealing with a large number of indexers.Consider using asynchronous tasks to fetch search results from multiple indexers concurrently, utilizing
asyncio.gather
. This can significantly improve the performance of the scraping process.Here's a suggested refactor:
- for indexer_id in indexer_ids: - indexer_status = self.indexer_status.get(indexer_id, {}) - if not indexer_status.get("is_healthy", False): - continue - circuit_breaker = self.get_circuit_breaker(indexer_id) - indexer_name = indexer_status.get("name", f"ID:{indexer_id}") - if circuit_breaker.is_closed(): - try: - search_params = {**params, "indexerIds": [indexer_id]} - async with httpx.AsyncClient(headers=self.headers) as client: - response = await client.get( - self.search_url, - params=search_params, - timeout=timeout, - ) - response.raise_for_status() - indexer_results = response.json() - # Record success - circuit_breaker.record_success() - self.metrics.record_indexer_success( - indexer_name, len(indexer_results) - ) - results.extend(indexer_results) - except Exception as e: - # Exception handling as before - pass - else: - # Handle circuit breaker open - pass + async def fetch_for_indexer(indexer_id): + indexer_status = self.indexer_status.get(indexer_id, {}) + if not indexer_status.get("is_healthy", False): + return [] + circuit_breaker = self.get_circuit_breaker(indexer_id) + indexer_name = indexer_status.get("name", f"ID:{indexer_id}") + if circuit_breaker.is_closed(): + try: + search_params = {**params, "indexerIds": [indexer_id]} + async with httpx.AsyncClient(headers=self.headers) as client: + response = await client.get( + self.search_url, + params=search_params, + timeout=timeout, + ) + response.raise_for_status() + indexer_results = response.json() + # Record success + circuit_breaker.record_success() + self.metrics.record_indexer_success( + indexer_name, len(indexer_results) + ) + return indexer_results + except Exception as e: + # Exception handling as before + return [] + else: + # Handle circuit breaker open + return [] + + tasks = [fetch_for_indexer(indexer_id) for indexer_id in indexer_ids] + indexer_results_list = await asyncio.gather(*tasks) + # Flatten the list of results + results = [item for sublist in indexer_results_list for item in sublist]
986-1020
: Reduce code duplication in background search tasksThe
background_movie_title_search
andbackground_series_title_search
methods contain duplicated code for fetching healthy indexers, splitting them into chunks, initializing metrics, and creating search generators.Consider refactoring the common logic into a shared helper function or method to promote code reuse and improve maintainability.
Also applies to: 1075-1108
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
db/config.py
(1 hunks)db/crud.py
(3 hunks)scrapers/base_scraper.py
(8 hunks)scrapers/helpers.py
(1 hunks)scrapers/prowlarr.py
(23 hunks)scrapers/prowlarr_feed.py
(3 hunks)scrapers/torrentio.py
(5 hunks)scrapers/zilean.py
(5 hunks)utils/network.py
(3 hunks)
🧰 Additional context used
🪛 Ruff
scrapers/base_scraper.py
379-396: BaseScraper.parse_response
is an empty method in an abstract base class, but has no abstract decorator
(B027)
scrapers/prowlarr.py
419-419: Undefined name ExceptionGroup
. Consider specifying requires-python = ">= 3.11"
or tool.ruff.target-version = "py311"
in your pyproject.toml
file.
(F821)
utils/network.py
42-45: Combine if
branches using logical or
operator
Combine if
branches
(SIM114)
🔇 Additional comments (18)
scrapers/helpers.py (1)
103-111
: LGTM! Error handling has been implemented correctly.
The implementation properly handles HTTP request failures and invalid responses, preventing crashes and providing good error visibility through logging.
db/config.py (1)
39-39
: Verify the impact of reducing the timeout from 120s to 15s
While reducing the timeout can improve system responsiveness and prevent resource exhaustion, 15 seconds might be too aggressive for some slower indexers. Let's verify the impact:
✅ Verification successful
Based on the search results, I can now generate a final response:
15 seconds timeout is appropriate and well-aligned with the codebase
The reduced timeout value is consistent with existing timeout patterns in the codebase:
- Other Prowlarr-related operations use similar timeouts (e.g., 10s for indexer status checks)
- The system has robust error handling for timeouts in
prowlarr.py
with:- Circuit breaker pattern (2 failures threshold, 10s recovery)
- Proper error logging and metrics collection
- Retry mechanisms through background processing
- Similar services in the codebase use comparable timeouts:
- Debrid client: 15s
- Base scraper client: 30s
- Most API calls: 10-15s
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for timeout-related error logs and any indexer performance metrics
# Search for timeout-related errors or warnings in the logs
rg -i "timeout|timed?.?out" --type py
# Look for any performance metrics or timing configurations
rg -i "prowlarr.*timeout|search.*timeout" --type py
# Check for any existing error handling related to timeouts
ast-grep --pattern 'try {
$$$
timeout
$$$
} catch {
$$$
}'
Length of output: 8649
scrapers/zilean.py (2)
87-90
: Ensure validate_response
method handles all edge cases
After recording the fetched items, the validate_response
method is called. Ensure that this method thoroughly checks for all possible invalid or unexpected responses to prevent errors downstream.
Run the following script to review the validate_response
method implementation:
✅ Verification successful
The validate_response
method is appropriately implemented for its use case
The validate_response
method in zilean.py
performs the necessary validation for its specific requirements:
- It checks if the response is a list type
- It verifies the list is not empty
- This validation is sufficient as the subsequent
parse_response
andprocess_stream
methods handle the detailed validation of individual stream items
The error handling is comprehensive across the workflow:
- Failed requests are logged and tracked via metrics
- Invalid responses trigger "no_valid_streams" error
- Stream processing errors are caught and tracked
- Adult content is filtered
- Invalid title/year matches are filtered
- Missing season/episode information for series is handled
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Retrieve and display the `validate_response` method to ensure it handles all edge cases.
# Search for the `validate_response` method definition in the codebase.
ast-grep --pattern $'def validate_response($_):\n $$$' --lang python
Length of output: 71
Script:
#!/bin/bash
# Let's try a broader search to find the validate_response method
rg -A 10 "validate_response" --type python
Length of output: 73
Script:
#!/bin/bash
# Let's try a simpler search without file type specification
rg -A 10 "validate_response"
# Also let's check the content of the zilean.py file to understand the context better
fd "zilean.py" --exec cat {}
Length of output: 10185
174-183
: Review episode handling logic for potential issues
The logic for handling episodes might not cover all scenarios, such as when episodes are missing but the season matches. Ensure that all edge cases are addressed to prevent skipping valid streams.
Run the following script to check for streams where episodes are missing but seasons match:
✅ Verification successful
Let me gather more context about the episode handling logic and related code.
Let me check one more thing to understand how seasons are handled in this context.
The episode handling logic is correct and handles edge cases appropriately
The code follows a clear and logical flow for handling episodes:
- First checks if explicit episode numbers are provided in
torrent_data["episodes"]
- If no episodes but the season matches, it defaults to episode 1 (common for single-episode seasons)
- Otherwise, correctly skips the stream when episode info is missing
The season in seasons
check is used after validating that seasons
contains exactly one season number, making it a safe fallback for single-episode content.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find streams where episodes are missing but seasons match the metadata.
# Test: Search for such streams in the codebase.
# Note: Replace the placeholders with actual values if necessary.
rg -A 5 "if episodes := torrent_data.get\(\"episodes\"\)" --glob "**/*.py"
Length of output: 490
Script:
#!/bin/bash
# Get more context around the episode handling logic
rg -B 10 -A 10 "if episodes := torrent_data.get\(\"episodes\"\)" --glob "**/*.py"
# Find other episode-related logic in the codebase
rg "Episode\(" --glob "**/*.py"
# Look for season-related logic
rg "season in seasons" --glob "**/*.py"
Length of output: 3125
Script:
#!/bin/bash
# Get more context about how seasons are handled
rg -B 15 -A 5 "season in seasons" --glob "**/*.py"
# Look for the Season model/type definition
ast-grep --pattern 'class Season'
Length of output: 1479
scrapers/torrentio.py (9)
50-54
: Approve addition of metrics for response validation
The added metrics for recording invalid responses and found items enhance monitoring and error tracking. This improvement is aligned with better error handling practices.
60-63
: Approve enhanced error handling with specific metrics
Recording specific error metrics for request_failed
and unexpected_error
improves the ability to diagnose and address issues more effectively.
96-96
: Approve skip metric for adult content streams
Recording skips for streams containing adult content enhances the granularity of metrics and assists in content filtering analysis.
145-145
: Approve skip metric for multiple seasons torrents
Adding a skip metric for torrents with multiple seasons improves tracking of unprocessed items due to content structure, aiding in scraper performance analysis.
176-180
: Approve recording of successful processing metrics
Recording metrics for processed items, quality, and source provides valuable insights into scraper performance and stream quality distribution.
184-184
: Approve error handling in stream processing
Recording stream_processing_error
enhances error tracking for individual streams and aids in debugging specific issues during processing.
196-199
: Ensure convert_size_to_bytes
handles empty or invalid size strings
In the code:
"size": convert_size_to_bytes(
self.extract_size_string(stream["title"])
),
If extract_size_string
returns an empty string or an invalid size format, convert_size_to_bytes
may raise an exception. Ensure that this scenario is handled gracefully, possibly by validating the size string before conversion or handling exceptions within convert_size_to_bytes
.
31-31
:
Verify references to the renamed method _scrape_and_parse
The method scrape_and_parse
has been renamed to _scrape_and_parse
, indicating it's intended for internal use. Verify that all references to this method have been updated accordingly to prevent any potential issues.
191-191
:
Handle potential unpacking errors when splitting stream title
If stream["title"]
does not contain at least two lines, the unpacking in torrent_name, file_name = stream["title"].splitlines()[:2]
could raise a ValueError
. Consider adding a check to ensure that stream["title"]
contains the expected number of lines before unpacking.
Apply this diff to handle cases where stream["title"]
may not have the expected format:
def parse_stream_title(self, stream: dict) -> dict:
try:
- torrent_name, file_name = stream["title"].splitlines()[:2]
+ lines = stream["title"].splitlines()
+ if len(lines) >= 2:
+ torrent_name, file_name = lines[:2]
+ else:
+ self.metrics.record_error("title_format_error")
+ self.logger.warning(f"Unexpected title format: {stream['title']}")
+ return None
Likely invalid or redundant comment.
scrapers/base_scraper.py (4)
21-233
: Well-structured implementation of ScraperMetrics
class
The ScraperMetrics
class is well-designed to encapsulate metrics related to the scraping process. It effectively tracks various statistics and provides methods for starting, stopping, recording metrics, and formatting summaries. The use of @dataclass
simplifies the class definition and enhances readability.
252-283
: Robust implementation of scrape_and_parse
method
The scrape_and_parse
method has been effectively updated to initialize metrics collection, handle scraping, record errors, and log the summary upon completion. The use of try-except-finally
ensures that metrics are properly recorded even if exceptions occur.
285-288
: Correct use of @abc.abstractmethod
for _scrape_and_parse
Marking _scrape_and_parse
as an abstract method correctly enforces that subclasses must implement this method.
311-314
: Properly skipping recent scrapes based on cache
The implementation correctly skips scraping when the item has been scraped recently, and records this action using self.metrics.skip_scrape()
.
db/crud.py (1)
532-583
: Simplified aggregation pipeline improves efficiency and readability
The changes to the aggregation pipeline in the get_series_meta
function simplify the logic for retrieving unique season and episode combinations, enhancing clarity and maintainability.
Summary by CodeRabbit
Release Notes
New Features
ScraperMetrics
class for detailed insights into the scraping process.Bug Fixes
Improvements
Documentation