Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: error writing to storage (while fsync) #902

Closed
1 task done
ali1234 opened this issue Aug 6, 2024 · 9 comments
Closed
1 task done

[BUG]: error writing to storage (while fsync) #902

ali1234 opened this issue Aug 6, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@ali1234
Copy link
Contributor

ali1234 commented Aug 6, 2024

What happened?

When attempting to write an image to any SD card, this error happens near the end of writing. The SD card is left without a valid filesystem.

Writing the same image with dd works correctly (verified matching checksum on read-back).

Version

1.8.5 (Default)

What host operating system were you using?

Debian and derivatives (eg Ubuntu)

Host OS Version

Ubuntu 24.04

Selected OS

Raspberry Pi OS Lite 64 bit

Which Raspberry Pi Device are you using?

Raspberry Pi 5

What kind of storage device are you using?

microSD Card in a USB reader

OS Customisation

  • Yes, I was using OS Customisation when the bug occurred.

Relevant log output

Available disk space for caching: 709 GB
Unmounting: "/dev/sdf"
Telemetry done. cURL status code = 0 info sent = "url=https%3A%2F%2Fdownloads.raspberrypi.com%2Fraspios_lite_arm64%2Fimages%2Fraspios_lite_arm64-2024-07-04%2F2024-07-04-raspios-bookworm-arm64-lite.img.xz&os=Raspberry%20Pi%20OS%20%28other%29&image=Raspberry%20Pi%20OS%20Lite%20%2864-bit%29&imagerVersion=1.8.5&imagerOsType=ubuntu&imagerOsVersion=24.04&imagerOsArch=x86_64&imagerLocale=en_GB"
BLKDISCARD not supported
Zeroing out first and last MB of drive
Done zeroing out start and end of drive. Took 0 seconds
Image URL: "https://downloads.raspberrypi.com/raspios_lite_arm64/images/raspios_lite_arm64-2024-07-04/2024-07-04-raspios-bookworm-arm64-lite.img.xz"
Received header: "HTTP/2 200"
Received header: "date: Tue, 06 Aug 2024 14:34:51 GMT"
Received header: "server: Apache"
Received header: "last-modified: Thu, 04 Jul 2024 00:29:39 GMT"
Received header: "etag: \"1afc1ccc-61c610ad8788c\""
Received header: "accept-ranges: bytes"
Received header: "content-length: 452730060"
Received header: "content-type: application/x-xz"
Received header: ""
Download done in 58 seconds
Hash of uncompressed image: "2746d9ff409c34de471f615a94a3f15917dca9829ddbc7a9528f019e97dc4f03"
Done writing cache file
@ali1234 ali1234 added the bug Something isn't working label Aug 6, 2024
@ali1234
Copy link
Contributor Author

ali1234 commented Aug 6, 2024

Here are some things I already tried to resolve this:

  1. Installing the deb from raspberrypi.com
  2. Installing the version from snap, which is also 1.8.5
  3. Installing the version from the Ubuntu repositories, which is also 1.8.5
  4. Using a different USB writer.
  5. Using a different SD card.
  6. Plugging the USB writer into a USB2 port.
  7. Plugging the USB writer into a USB3 port.
  8. Running the software as root.
  9. Turning off OS customization.

None of these steps had any effect.

Upon examining the SD card after the error I see this:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00400000  eb 58 90 6d 6b 66 73 2e  66 61 74 00 02 04 20 00  |.X.mkfs.fat... .|

This shows that the first megabyte of the disk has been zeroed, exactly as the log said it would be. Subsequently, no data from the OS image has been written at all, otherwise the beginning of the disk would not contain zeros.

This means that the software is capable of writing to the disk. There is no permission or configuration error preventing it.

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 6, 2024

Examining the code shows that this is an error inside downloadthread.cpp, and happens because fsync fails. The fsync code is disabled on Windows, because it fails there (see https://forums.raspberrypi.com/viewtopic.php?t=267364) however the problems with this code are much, much deeper.

Firstly, this is what the code is supposed to do:

  1. The code first zeros the first and last megabyte of the device, then fsyncs.
  2. Then it writes all of the image except for the first megabyte, and fsyncs again. (This is where the error happens, and is why the first megabyte of the device is empty in my previous comment.)
  3. It then verifies everything written so far.
  4. It writes the first megabyte and fsyncs again.

Now there are a few problems with this:

Firstly during step 1., this happens:

        if (!_file.seek(knownsize-emptyMB.size())
                || !_file.write(emptyMB.data(), emptyMB.size())
                || !_file.flush()
                || !::fsync(_file.handle()))
        {
            emit error(tr("Write error while trying to zero out last part of card.<br>"
                          "Card could be advertising wrong capacity (possible counterfeit)."));
            return false;
        }

This step did not fail. Why? Because it succeeds if seek, write, flush, and fsync all return non-zero. But this is not correct: seek, write, flush are all methods of QFileDevice and return zero/false on error. But fsync is posix and it returns zero on success. So in order for this whole operation to succeed, the first three operations must succeed, and then the fsync must fail. Which it does, and so I don't see that error message.

Later on during step 2. we have this:

    if (::fsync(_file.handle()) != 0) {
        DownloadThread::_onDownloadError(tr("Error writing to storage (while fsync)"));
        _closeFiles();
        return;
    }

This time the fsync return value is correctly interpreted as non-zero == error, so if fsync fails, we get an error.

All of this code is disabled on windows by #ifndef Q_OS_WIN so the result is that successfully writing an image on non-Windows requires an incredibly unlikely sequence of events where calling fsync fails the first time, but succeeds the second and third time.

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 6, 2024

Digging deeper, the reason why step 1. does not fail is because of yet another bug: QFile.size can't determine the size of a block device, so knownsize is always zero and the whole block of code, including the fsync, never runs.

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 6, 2024

I commented out the fsyncs and the imaging completed successfully. However, the resulting SD card was extremely corrupted. No subdirectories of the root filesystem can be read due to filesystem errors.

I tried again with a different SD card and this time the rootfs doesn't show up as a mountable filesystem at all, even though the partition table shows two partitions.

Writing the image using dd worked correctly with both SD cards.

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 6, 2024

So I asked on Qt IRC channel to confirm the behavior of QFile.size() and the answer was that currently it is incapable of returning the size of a block device. But then Thiago just went and implemented it. So that broken code branch will likely start causing a problem when that filters through to a release.

I also noticed that QFile documentation has a warning that it shouldn't be used to write Unix device files...

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 7, 2024

I determined that this is something to do with the USB writers. What happens is that at some point a block is dropped. This doesn't cause a write error and from this point on all writes apparently succeed instantly. It then fails during fsync, so if that is is disabled it appears to complete successfully but the data is truncated. This happens reliably on both USB2 and USB3 ports. The odd thing is they all seem to work fine with dd, even when followed up with sync.

@tdewey-rpi
Copy link
Collaborator

Thanks for the report, @ali1234

You've gone above and beyond in terms of reporting, for which I am extremely thankful. I'm going to review your findings later today, but if you can't wait for me to address it I'm happy to review a PR.

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 7, 2024

I'll send a PR for the reversed fsync check when zeroing. That currently is not harmful. edit: fsync PR is at #903.

The original errors I encountered are not really actionable. I suspect it could be heat related as the failing writers are very small and get really hot. I found a writer that works and it is a much larger USB3 one (tested on a USB3 port).

@ali1234
Copy link
Contributor Author

ali1234 commented Aug 7, 2024

After some more playing around I was able to reproduce the corruption with dd on the bad writers. It is much rarer, but this shows it's a hardware or kernel problem. So I'm going to close this.

@ali1234 ali1234 closed this as completed Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants