-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement chunked uploading with lazy ops header #36
Comments
phoenix uses chunked uploading and the sync client as well, so I am going to add a PoC for tho owncloud storage driver, based on redis. Upstream issue is cs3org/reva#266. This serves to keep track of the PoC using redis. |
We could use https://github.com/thoas/bokchoy to use a library that implements queues on top of redis. But the question now is where do we put things in the queue? In ocdavsvc? Or does the storage provider have to deal with that? The cs3 has no notion of chunked uploads. It leaves the upload to the http data svc using the https://cs3org.github.io/cs3apis/#cs3.storageproviderv0alpha.InitiateFileUploadResponse We could implement the ocdav svc to create a temporary folder inside the target directory, but we agreed that having a quarantine area makes sense. The quarantine area could be a distinct storage to keep data from passing storage boundaries, eg by a virus scan. So an upload would be assemble in storage a, scan, if successful: move to storage b (final destination). urgh ... we need to keep track of these events anyway, so the user can list the storage activities ... which includes renames, creates, deletes and should aggregate events to the same file ... which brings us back to https://github.com/owncloud/nexus/issues/69 There is a difference between the filesystem events and the activities the user can see ... if a file is updatet at 1000Hz it does not make sense totigger the workflow for the file that many times. At least for high ingress storages. For more secure storages it makes sense to allow uploads at 1000Hz but only allowing downloads until the file has passed postprocessing. If you are editing a file then the etag will change and you can overwrite the file because you know the current etag. that would allows updating the file sequentially, in effect locking the updates to a single producer. he will have to wait for a response though. The client can choose not to send the If-match header and just overwrite whatever is there. It actually does not matter if there is a workflow. the file content is not readable until postprocessing unlocks the file. What if a file upload is in progress ... would clients get the old version of the content or would we fail? And this is how write locks came into being ... with all the horrible lock starvation that comes with it. |
sidenote on filesystem journals as sources for the activity:
so ... neither of those make sense as a source for the activity log ... hm ... well ... maybe in addition to inotify? to truly scale we nood to push the events into the storage layer, where we can assume enough free storage as it can be scaled better than ram. In conclusion I would argue it makes sense to keep the events as individual files on the storage. That allows geo-distributing them along the file data and metadata. which is one of the core design principles for ocis. |
hm, is the activity log more than a recursive list of files, folders, trash items and versions ordered by mtime? If so it is a cache and can be reconstructed from the existing metadata... for now, cs3 has neither an activity api nor an event bus or a queue. the implementation needs to happen in the ocs api anyway. the activity service can be implemented in many ways:
|
how do we translate chunked uploads to the ocs api into cs3?
|
given that the lazy ops header implementation already introduces a job queue I will implement a queue outside of the cs3 api. we can add one to cs3 if it becomes necessary. however, I assume that becomes an issue when trying to implement chunked parallel uploads using the cs3 api on its own. |
|
on the one hand, we want to avoid having to copy the file to the target destination when they reside on the same storage. A move would be fister in this case. On the other hand we want to prevent access if the file has not gone through postprocessing. The idea of the quarantine area was born out of the pain of having to copy data again after it has been uploaded. For small files this may not be an issue, but it becomes painful for large files. Thinking of the quarantine area as a high ingress storage with restricted access might make more sense. First, data transfer between cient and server should be fast... preferably parallel and allow other clients do download the chunks while they are being uploaded. But other client should only be able to access the file when it has been processed. How can they download chunks if the file has not yet been processed? clients could generate a random symmetric encryption key and send it to the server. all chunks ar encrypted symmetrically. The clients can start downloading chunks, even if the file has not been processed. if processing finishes the server releases the key to the clients and they can decrypt the chunks instead of waiting for the complete file. This would decrease latency ... but it might cause a redownload if postprocessing changes the file ... In any case ... I need to think about the upload and quarantine area as a high ingress storage ... the question is if cs3 can detect if the underlying storage is the same and issue a move insteayd of a copy... |
the best I can come up with for now is to put chunked uploads into the users home storage. under a hidden |
|
|
hm the cs3 api does not deal with file up or download. maybe this is more a question of the datasvc ... |
aaand @evert had his say on tus.io as well: tus/tus-resumable-upload-protocol#14 (comment) Lookid into it as a replacement for the datasvc... still does not handle multiple small files well ... any pointer in that regard is welcome. |
Obviously, we can teach datasvc the http://sabre.io/dav/http-patch/ tricks ... but it would require keeping track of upload progress ... the HEAD request of tus.io is nice to resume uploads. |
link: https://blog.daftcode.pl/how-to-make-uploading-10x-faster-f5b3f9cfcd52 uses compression in the browser to reduce upload size. I wonder if normal PUT requests natively support compression. They should, shouldn't they?
maybe we can use https://github.com/nodeca/pako or https://github.com/photopea/UZIP.js to compress multiple small files into a single bytestream and upload that. all of this in an incremental way ... oh well ... maybe as an extension to tus.io? |
HTTP requests should just be able to use |
@evert I only know |
Yes that's exactly what I meant =) |
regarding capabilities we currently have:
we could indicate the new capabilities with new properties or increase the version number:
|
In theory it would be possible to model the upload/quarantine area as a dedicated storage that as its last workflow step has the MOVE or if necessary COPY operation to the target storage. Unfortunately, this would need special handling of the fileid, because a cross storage move changes the file id. So when a client initiated an upload it will not get the correct file id until the file has reached the target storage... how would this look like?
Intermezzo ... The storage drivers only have an Upload function ... it currently does not allow append or range requests. This is a problem if we want to implement resumable uploads for CS3 and use HTTP/2 because we cannot resume uploads. Furthermore, the API does not allow implementing an assembly approach. As it stands, we need to have the file complete before we can call Upload. If we assume the datasvc sits next to the storageprovidersvc they should have access to the storage. In the case of the owncloud and eos storage driver we could in theory bypass the Upload and directly write to the storage, including seeking to offsets etc. But for s3 this is not possible. S3 however does have multipart uploads. Does that map to our current chunked uploading? well, again, if we get the target in the initial MKCOL we can start the S3 multipart upload. So, this is more a question of how to implement a datasvc that is storage driver specific. ... end intermezzo
I recommend we replace the datasvc with a tusd based implementation. But since the current chunking does not send the taget in advance, and the CS3 api as well as the storage driver API only support uploading as a single operation we need to defer this and describe a proper solution before going further in that direction. The current eos driver uses a temp file as well, so the owncloud driver, or any other for that matter has to suffer the penalty of doing a full copy of every file upload .... urgh this is 🐮 💩 |
Ok, so on the other half of the implementation we have an asynchronous chunk assembly. Normally I would use a proper task queue like https://github.com/RichardKnop/machinery/ to make the queu persistent, prevent multiple processes from starting the assembly ( locking or rether task deduplication). However, let us remember the idea to push metadata as close to the storage as possible? S3 eg has its own multipart upload. I don't know if it allows uploads to the same upload id from different regions ... I don't see a usecase for it, because a client is likely not going to switch the region between chunks. So, do we need to trasport chunks over geo boundaries? If we want to allow asymettric download of chunks while the file is not fully uploaded that might be a good idea. But for this the chounks would need to be stored on the storage, not in the ocdav svc that currently implements chunking. So this needs to be postponed as well, until we discuss chunking v3? Which leaves the question of how to prevent multiple ocdavsvc processes from initiating an assembly at the same time. In a HA scenario we have at least two instances. Do they share a queue or do we lock the upload to prevent an additional assembly. Actually what if a client (or a proxy in between) repeats the move. We need to lock the chunked upload dir, not only to the process but to the go routine and deal with timeouts and errors ... tomorrow... tusd has an in memory or file based locking provider. redis is mentioned, but the only other implementation I found uses etcd. Another pointer towads tusd ... they thought about this ... Anyway, we can just create a lockfile. whoever gets the lock creates a uuid for the |
|
…-started In case the http server cannot be started the error is logged
blocked. see #36 (comment) |
lazy ops got removed from the clients: owncloud/client#8398 |
We need to bring this to reva as clients want to drop support for the old chunking algorithm
The text was updated successfully, but these errors were encountered: