Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove I/O in event loop for backup create and restore operations #5634

Merged
merged 9 commits into from
Feb 18, 2025

Conversation

agners
Copy link
Member

@agners agners commented Feb 14, 2025

Proposed change

This PR refactors the Supervisor backup create and restore operations to be asyncio event loop friendly. Specifically, all I/O is moved into executor threads, even "small" I/O like file existence checks. This is especially important since we support network storage, which may have significant delay for operations (e.g. when a NAS spins down HDDs).

There is still some I/O in the event loop, e.g. when loading backups. These will be addressed in follow-up PRs.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

Summary by CodeRabbit

  • Refactor

    • Improved backup creation and restoration workflows with enhanced error handling and streamlined folder management.
    • Simplified validation of backup locations to ensure operations run only on available mounts.
    • Redesigned folder removal to support exclusions and improve clarity.
    • Restructured restoration process for better readability and maintainability.
  • Bug Fixes

    • Addressed issues with missing backup files and inaccurate mount validations.
  • Tests

    • Updated test cases to align with the revised synchronous folder removal process.

@agners agners added bugfix A bug fix refactor A code change that neither fixes a bug nor adds a feature labels Feb 14, 2025
@agners agners changed the title Remove I/O in event loop for backup create and restore opreations Remove I/O in event loop for backup create and restore operations Feb 14, 2025
@agners agners force-pushed the remove-io-from-backup-create-and-restore branch from 2f462ad to 0be8ff4 Compare February 14, 2025 16:35
Make remove_folder and remove_folder_with_excludes synchronous
functions which need to be run in an executor thread to be safely used
in asyncio. This makes them better composable with other I/O operations
like checking for file existence etc.
@agners agners force-pushed the remove-io-from-backup-create-and-restore branch from 0be8ff4 to f75f9d6 Compare February 14, 2025 17:23
@agners agners marked this pull request as ready for review February 14, 2025 17:23
Copy link
Contributor

coderabbitai bot commented Feb 14, 2025

📝 Walkthrough

Walkthrough

This pull request refactors and enhances backup, restore, and folder removal routines across several modules. It updates type annotations and method signatures, encapsulates file handling logic into helper functions, and tightens error handling—particularly for mount validations and file existence checks. Additionally, synchronous execution is adopted for folder removal utilities with corresponding test updates. These changes affect the Backup, BackupManager, HomeAssistant restore process, GitRepo removal, and related utility functions.

Changes

File(s) Change Summary
supervisor/backups/backup.py
supervisor/backups/manager.py
Updated type annotations and method signatures; refactored backup creation, open, and folder save/restore methods; enhanced error handling and streamlined mount checks in backup operations.
supervisor/homeassistant/module.py Refactored the HomeAssistant restore process by replacing a nested extraction function with a dedicated _restore_home_assistant function, integrating tarfile extraction, metadata processing, and improved error handling.
supervisor/resolution/fixups/store_execute_reset.py
supervisor/store/git.py
Modified folder removal logic: moved to asynchronous execution via sys_run_in_executor with an inner function (_remove_git_dir) encapsulating directory checks in GitRepo, and updated fixup process accordingly.
supervisor/utils/__init__.py
tests/utils/test_remove_folder.py
Converted remove_folder from asynchronous to synchronous; introduced remove_folder_with_excludes for handling exclusions during deletion; updated tests to remove async constructs.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as User/System
    participant HA as HomeAssistant Module
    participant RHA as _restore_home_assistant
    participant Tar as Tarfile Extraction Process

    Caller->>HA: invoke restore(tar_file, exclude_database)
    HA->>RHA: call _restore_home_assistant()
    RHA->>Tar: Extract tar file and process metadata
    Tar-->>RHA: Return extraction results
    RHA-->>HA: Return Home Assistant metadata
    HA->>Caller: Complete restoration with metadata validation
Loading
sequenceDiagram
    participant Caller as User/System
    participant GR as GitRepo Instance
    participant Inner as _remove_git_dir
    participant Exec as sys_run_in_executor
    participant RF as remove_folder

    Caller->>GR: Call _remove()
    GR->>Inner: Execute removal logic (_remove_git_dir)
    Inner->>Exec: Schedule removal in executor context
    Exec->>RF: Invoke remove_folder for directory removal
    RF-->>Exec: Return after folder removal
    Exec-->>Inner: Complete asynchronous execution
    Inner-->>GR: Removal process finished
Loading

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4253e27 and 1ab07ec.

📒 Files selected for processing (1)
  • supervisor/backups/backup.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • supervisor/backups/backup.py
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build armv7 supervisor
  • GitHub Check: Build armhf supervisor
  • GitHub Check: Build aarch64 supervisor
  • GitHub Check: Run tests Python 3.13.2

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
supervisor/homeassistant/module.py (1)

12-12: Consider more precise typing for _restore_home_assistant
It looks like _restore_home_assistant returns either a dict or None. Using Any is functional but reduces clarity.

Here's a sample diff for the return type:

-from typing import Any
+from typing import Any, Optional, Dict
-def _restore_home_assistant() -> Any:
+def _restore_home_assistant() -> Optional[Dict[str, Any]]:
supervisor/backups/backup.py (1)

97-97: Enhance type annotation for clarity.

The type annotation for _tmp should be TemporaryDirectory | None since it can be None.

-    self._tmp: TemporaryDirectory = None
+    self._tmp: TemporaryDirectory | None = None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c108ee and f75f9d6.

📒 Files selected for processing (7)
  • supervisor/backups/backup.py (7 hunks)
  • supervisor/backups/manager.py (6 hunks)
  • supervisor/homeassistant/module.py (3 hunks)
  • supervisor/resolution/fixups/store_execute_reset.py (1 hunks)
  • supervisor/store/git.py (1 hunks)
  • supervisor/utils/__init__.py (2 hunks)
  • tests/utils/test_remove_folder.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build armv7 supervisor
  • GitHub Check: Build armhf supervisor
  • GitHub Check: Build aarch64 supervisor
  • GitHub Check: Run tests Python 3.13.2
🔇 Additional comments (15)
supervisor/homeassistant/module.py (1)

50-50: Good import adjustment
Importing remove_folder_with_excludes aligns nicely with the new approach to handle excluded paths.

tests/utils/test_remove_folder.py (4)

9-9: Non-async test function looks good
Switching from async def to a regular function is appropriate now that remove_folder is synchronous.


17-17: Synchronous call is consistent with new remove_folder behavior
Removing await usage aligns with the updated non-async function signature.


21-21: Regular test function updated
Converting the second test function to a synchronous style is correct given the helper function changes.


37-37: Remove await for synchronous folder removal
Again, consistent change following the refactor of remove_folder.

supervisor/resolution/fixups/store_execute_reset.py (1)

43-43: Offload folder removal to an executor
This change properly shifts blocking I/O out of the event loop, aligning well with the PR objective.

supervisor/utils/__init__.py (2)

84-110: LGTM! The synchronous implementation aligns with the PR objectives.

The refactoring of remove_folder from async to sync is a good change as it:

  • Moves I/O operations out of the event loop.
  • Simplifies error handling by using subprocess.run.
  • Maintains the same functionality with cleaner code.

112-127: LGTM! Well-structured implementation of file exclusion handling.

The new remove_folder_with_excludes function:

  • Safely handles file exclusions using a temporary directory.
  • Properly restores excluded files after removal.
  • Uses context manager for automatic cleanup.
supervisor/store/git.py (1)

192-198: LGTM! Good encapsulation of removal logic.

The inner function _remove_git_dir effectively:

  • Encapsulates directory check and removal logic.
  • Maintains clean error handling.
  • Aligns with the PR's objective of moving I/O out of the event loop.
supervisor/backups/backup.py (4)

467-481: LGTM! Well-structured extraction of tarfile creation logic.

The new _open_outer_tarfile function:

  • Improves code organization.
  • Enhances error handling with clear messages.
  • Aligns with the PR's objective of moving I/O out of the event loop.

508-513: LGTM! Enhanced error handling for backup file access.

The improved error handling:

  • Checks file existence before proceeding.
  • Provides clear error messages.
  • Triggers backup reload when file is not found.

686-724: LGTM! Improved folder save operation with return value.

The changes:

  • Add return value to indicate success/failure.
  • Only append folder to list if save was successful.
  • Better error handling.

783-799: LGTM! Enhanced mount handling during folder restore.

The changes:

  • Properly handle bind mounts within the folder.
  • Ensure mounts are unmounted before restore.
  • Properly remount after restore.
supervisor/backups/manager.py (2)

473-484: LGTM! Enhanced mount state verification before backup.

The changes:

  • Verify mount state before starting backup.
  • Clear error messages for mount issues.
  • Aligns with PR's objective of better I/O handling.

352-360: LGTM! Improved error handling for mount operations.

The changes:

  • Check mount state before copy operations.
  • Clear error messages for mount issues.
  • Proper type handling with Mount objects.

supervisor/homeassistant/module.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Have to fix the small impact on import and I'd really prefer we avoid setting data on the Backup instances in the executor since they are globals through BackupManager. Just want to be extra cautious to prevent some of the race condition issues we've had before refactoring stuff to the executor.

supervisor/backups/backup.py Outdated Show resolved Hide resolved
supervisor/backups/manager.py Show resolved Hide resolved
Comment on lines 473 to 484
# Before starting, check if the location is mounted, if a backup mount
if location == DEFAULT and self.sys_mounts.default_backup_mount:
location = self.sys_mounts.default_backup_mount

if location not in (DEFAULT, LOCATION_CLOUD_BACKUP, None):
location_mount: Mount = location
if not await self.sys_run_in_executor(location_mount.local_where.is_mount):
raise BackupMountDownError(
f"{location_mount.name} is down, cannot back-up to it",
_LOGGER.error,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of moving this logic into _do_backup we just make some kind of _check_mount method? Then we can call it from do_backup_partial, do_backup_full, import_backup and copy_to_additional_locations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds like a good idea. There is already a is_mounted() function in the Mount class (which is overwritten by the NetworkMount class. So we can reuse code here and make it applicable for all types of mounts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since I am using the is_mounted() async function from the Mount class, I can't use it in copy_to_additional_locations. I could by simply looping through locations first, but this would make executors than necessary. It is a tradeoff between code duplication and slightly more optimal code. I opted for the latter.

supervisor/backups/backup.py Outdated Show resolved Hide resolved
@home-assistant
Copy link

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

@home-assistant home-assistant bot marked this pull request as draft February 14, 2025 21:19
@agners
Copy link
Member Author

agners commented Feb 18, 2025

Looks good. Have to fix the small impact on import and I'd really prefer we avoid setting data on the Backup instances in the executor since they are globals through BackupManager. Just want to be extra cautious to prevent some of the race condition issues we've had before refactoring stuff to the executor.

Completely agree, we should avoid touching state in the class from the executor as much as possible, this is good practice!

That said, I think in practice it is not a problem for two reasons: Firstly, we do allow only a single backup to run (via job group limits/Supervisor state freeze etc.). Secondly, in today's Python, the global interpreter lock only allows a single Thread to execute Python code. So as long as the state updated in the executor represents a coherent state, it should not be a problem.

But still, this is just good practice, and maybe there is even a case where this could be problematic. I'll change the code accordingly.

@agners agners marked this pull request as ready for review February 18, 2025 17:13
@home-assistant home-assistant bot requested a review from mdegat01 February 18, 2025 17:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
supervisor/homeassistant/module.py (1)

462-466: Improve type annotation for _restore_home_assistant.

The return type annotation Any is too generic. Consider using a more specific type to better document the expected metadata structure.

-def _restore_home_assistant() -> Any:
+def _restore_home_assistant() -> dict[str, Any] | None:
supervisor/backups/manager.py (1)

365-371: Consider using async mount check in _copy_to_additional_locations.

The function uses is_mount() which is a synchronous check. Consider using the async is_mounted() method for consistency with _check_location.

-                        if not location_mount.local_where.is_mount():
+                        if not await location_mount.is_mounted():
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f75f9d6 and 4253e27.

📒 Files selected for processing (3)
  • supervisor/backups/backup.py (7 hunks)
  • supervisor/backups/manager.py (5 hunks)
  • supervisor/homeassistant/module.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Run tests Python 3.13.2
🔇 Additional comments (9)
supervisor/homeassistant/module.py (3)

480-484: LGTM! Improved error handling for tar file extraction.

The error handling for tar file extraction is well-structured, providing clear error messages and proper logging.


490-510: LGTM! Robust cleanup and error handling.

The code properly handles cleanup of temporary files and provides clear error messages for file operations.


525-536: LGTM! Proper validation of metadata.

The code correctly validates the metadata against the schema and provides clear error messages.

supervisor/backups/backup.py (4)

467-482: LGTM! Well-structured tarfile handling.

The _open_outer_tarfile function properly handles file existence checks and provides clear error messages.


690-729: LGTM! Improved folder backup logic.

The _folder_save function now returns a boolean to indicate success/failure and properly handles file existence checks.


757-768: LGTM! Enhanced folder restore validation.

The _folder_restore function now includes proper validation of the backup file and cleanup of existing content.


787-796: LGTM! Proper handling of bind mounts.

The code correctly handles unmounting and remounting of bind mounts during folder restoration.

supervisor/backups/manager.py (2)

126-138: LGTM! Well-structured location validation.

The _check_location method properly validates backup locations and provides clear error messages.


563-564: LGTM! Consistent location validation.

The code properly validates backup locations before performing backup operations.

Also applies to: 613-614

Copy link
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@agners agners merged commit 606db35 into main Feb 18, 2025
20 checks passed
@agners agners deleted the remove-io-from-backup-create-and-restore branch February 18, 2025 19:59
@hoppel118
Copy link

WOW! A lot of work have been done. Any idea, when we will see the fix in production?

Thanks for the time you investigated into fixing this. 👍

@agners
Copy link
Member Author

agners commented Feb 19, 2025

WOW! A lot of work have been done. Any idea, when we will see the fix in production?

Thanks for the time you investigated into fixing this. 👍

Unfortunately, there is more cases where I/O is done in the event loop in the backup create path. I am not sure if it is relevant for your case, but I'd like to remove all of it ideally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix A bug fix cla-signed refactor A code change that neither fixes a bug nor adds a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

automatic backup doesn't work sometimes
3 participants