Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContentLength and Body length mismatch #8606

Closed
butonic opened this issue Mar 8, 2024 · 3 comments
Closed

ContentLength and Body length mismatch #8606

butonic opened this issue Mar 8, 2024 · 3 comments
Assignees
Labels

Comments

@butonic
Copy link
Member

butonic commented Mar 8, 2024

Describe the bug

TL;dr

Running ocis v4.05 I encountered this log line:

{
    "level": "error",
    "error": "failed to upload file to blostore:
              could not store object 'bf39a18f-43b5-4b85-a499-5eeee611c6c7/52/5c/a8/d7/-779c-4616-9dcb-ce0113415a4b' into bucket 'ocis-blobs':
              Put \"https://minio/ocis-blobs/bf39a18f-43b5-4b85-a499-5eeee611c6c7/52/5c/a8/d7/-779c-4616-9dcb-ce0113415a4b\":
              net/http: HTTP/1.x transport connection broken:
              http: ContentLength=5075072 with Body length 4548689",
    "uploadid": "b6a1b52a-ad4b-44d7-a19d-be9d92e2ee5c",
    "time": "2024-03-01t18:02:34.411367426Z",
    "caller": "github.com/cs3org/reva/[email protected]/pkg/storage/utils/decomposedfs/decomposedfs.go:272",
    "message": "could not finalize upload"
}

The upload session will fail with this error every time I restart the postprocessing with ocis postprocessing restart --upload-id b6a1b52a-ad4b-44d7-a19d-be9d92e2ee5c. The upload metadata state seems to corrupted. The blob size on disk (body length) does not match the ContentLength (The node size). I think reading the node size for the upload from the current node size is painfully wrong. 😞

We need to double check this can no longer happen with the upload session refactoring in v5.

Long version:

The could not store object error comes from the s3 blobstore:

// Upload stores some data in the blobstore under the given key
func (bs *Blobstore) Upload(node *node.Node, source string) error {
	reader, err := os.Open(source)
	if err != nil {
		return errors.Wrap(err, "can not open source file to upload")
	}
	defer reader.Close()

	_, err = bs.client.PutObject(context.Background(), bs.bucket, bs.path(node), reader, node.Blobsize, minio.PutObjectOptions{ContentType: "application/octet-stream", SendContentMd5: true})

	if err != nil {
		return errors.Wrapf(err, "could not store object '%s' into bucket '%s'", bs.path(node), bs.bucket)
	}
	return nil
}

Which is called by upload.Finalize() which logs the failed to upload file to blostore message

// Finalize finalizes the upload (eg moves the file to the internal destination)
func (upload *Upload) Finalize() (err error) {
	ctx, span := tracer.Start(upload.Ctx, "Finalize")
	defer span.End()
	n := upload.Node
	if n == nil {
		var err error
		n, err = node.ReadNode(ctx, upload.lu, upload.Info.Storage["SpaceRoot"], upload.Info.Storage["NodeId"], false, nil, false)
		if err != nil {
			return err
		}
		upload.Node = n
	}

	// upload the data to the blobstore
	_, subspan := tracer.Start(ctx, "WriteBlob")
	err = upload.tp.WriteBlob(n, upload.binPath)
	subspan.End()
	if err != nil {
		return errors.Wrap(err, "failed to upload file to blostore")
	}

	return nil
}

Called by the postprocessing of decomposedfs

			case events.PPOutcomeContinue:
				if err := up.Finalize(); err != nil {
					log.Error().Err(err).Str("uploadID", ev.UploadID).Msg("could not finalize upload")
					keepUpload = true // should we keep the upload when assembling failed?
					failed = true
				}

All the error messages from the log line are in this trace. The root cause is a mismatch between the size we think we want to store and the actual bytes in the local file:

net/http: HTTP/1.x transport connection broken: http: ContentLength=5075072 with Body length 4548689

🤔

The session is finished: size and offset in the senssion data are the same:

{
    "offset": 4548689,
    "size": 4548689,
}

So 4548689 is the correct length and we pass it as the node.Blobsize to the PutObject call. I think 👀

Since we have a "sizediff": "4548689" the file was empty or did not exist before.

I am thinking that maybe a subsequent upload with 5075072 bytes now interferes with this upload? Following the code in the minio s3 client we can see that the node.Blobsize is passed as size and becomes the ContentLength header ... but why would the node.Blobsize be different than what the upload session metadata says?!? 👀

ok the upload we see has

        "nodeexists": "false",
        "nodeid": "2d1366f9-09ed-4341-8f69-4fdccf451e68",

it tinks it is a new file... but there is a bug where we read the node.Blobsize from the existing node again and then use it as a parameter for the Upload. In decomposedfs.go, before making the Finalize() call

			case events.PPOutcomeContinue:
				if err := up.Finalize(); err != nil {
					log.Error().Err(err).Str("uploadID", ev.UploadID).Msg("could not finalize upload")
					keepUpload = true // should we keep the upload when assembling failed?
					failed = true
				}

We actually read the node from disk:

			n, err := node.ReadNode(ctx, fs.lu, up.Info.Storage["SpaceRoot"], up.Info.Storage["NodeId"], false, nil, true)
			if err != nil {
				log.Error().Err(err).Str("uploadID", ev.UploadID).Msg("could not read node")
				continue
			}
			up.Node = n

but this node might already have a different blobsize set when a second upload finished before the first.

I assume the file exists with a size of 5075072 bytes ... but it is stuck in postprocessing by the upload b6a1b52a-ad4b-44d7-a19d-be9d92e2ee5c. Now I need to understand how we can unlock that mess ...

🤦

Steps to reproduce

I'm not sure. The log line is a followup error of other things going wrong. I assumo two uploads were in process, but the second larger one (5075072 ) finished before the first (4548689).

Expected behavior

Uploads should become versions of a node regardless of the order in which the postprocessing finishes.

Actual behavior

The data used to finish the upload is broken and we pass the wrong node size into the putobject function.

Setup

ocis v4.0.5 with s3ng and postprocessing enabled

@butonic
Copy link
Member Author

butonic commented Mar 8, 2024

In my case I did not need the old upload. The more recent upload finished fine, so I removed the old upload.

Take this with a grain of salt as it only works when the filesize in the metadata is correct. There might be other problems, so if you encounter this bug be sure to post your error message and the upload .info content and give us a chance to look at it.

Rough steps to fix this manually:

  • We need to check if the current blob is the 5075072 byte version and if the metadata actually matches the size.
  • We can then delete the old upload session from the disk and unset the postprocessing flag manually.

Now for some CLI magic:

First dump the current metadata of the node 2d1366f9-09ed-4341-8f69-4fdccf451e68 in space bf39a18f-43b5-4b85-a499-5eeee611c6c7:

$ ocis decomposedfs metadata --root /var/lib/ocis/storage/users/ --node 'bf39a18f-43b5-4b85-a499-5eeee611c6c7!2d1366f9-09ed-4341-8f69-4fdccf451e68' dump

Should produce sth like this:

user.ocis.blobid="dc26be2e-03f4-4e6e-b57a-9dfb1f9b18fa"
user.ocis.blobsize="5075072 "
user.ocis.cs.adler32=0soDboGQ==
user.ocis.cs.md5=0sqG5MegnENibH7gzrfVKYKA==
user.ocis.cs.sha1=0svKACX7JN7/HoE7S8rAOBZ+fuNvI=
user.ocis.mtime="2024-03-01T22:32:32.000000006Z"
user.ocis.name="redacted.jpg"
user.ocis.parentid="b92fe6e2-dc45-4ca0-9f05-33f0f8006b02"
user.ocis.type="1"

Note: Make sure no user.ocis.nodestatus is present. If it is, you can restart that upload session. It will always be the youngest one. When you see user.ocis.nodestatus="processing:f0b6135f-18f8-49dc-81c8-0e75082200b4" you can restart ocis postporcessing restart --upload-id f0b6135f-18f8-49dc-81c8-0e75082200b4. When it finishes the user.ocis.nodestatusshould no longer be present.

If the user.ocis.blobsize matches the 5075072 bytes we can delete the blob and info file for the stuck upload (notice the * at the end):

rm /var/lib/ocis/storage/users/uploads/b6a1b52a-ad4b-44d7-a19d-be9d92e2ee5c*

and then unset the postprocessing flag on disk:

ocis decomposedfs metadata --root /home/jfd/.ocis/storage/users/ --node 'bf39a18f-43b5-4b85-a499-5eeee611c6c7!2d1366f9-09ed-4341-8f69-4fdccf451e68' set -a user.ocis.nodestatus -v ''

dumping the node metadata again we should see a user.ocis.nodestatus="" which is interpreted as not processing.

This leaves the postprocessing service. It keeps track of the status of uploads. If the postprocessing is restarted it would have no effect, because we deleted the upload session. To properly clean up the postprocessing we would have to remove the upload id from the nats store that holds all postprocessing events. Or we just leave it in there ...

@butonic
Copy link
Member Author

butonic commented Apr 5, 2024

an attempt to fix it is in cs3org/reva#4615

needs a proper review

@butonic
Copy link
Member Author

butonic commented Apr 12, 2024

cs3org/reva#4615 was merged, new test cases were added in cs3org/reva#4625 They are tracked in #8848

@butonic butonic closed this as completed Apr 12, 2024
@github-project-automation github-project-automation bot moved this from Prio 2 to Done in Infinite Scale Team Board Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

1 participant