Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LazyBlob helper class to lazy load file from filesystem #104

Merged
merged 16 commits into from
Mar 20, 2023
Merged

Add LazyBlob helper class to lazy load file from filesystem #104

merged 16 commits into from
Mar 20, 2023

Conversation

Aschen
Copy link
Contributor

@Aschen Aschen commented Mar 17, 2023

Description

This PR add a new helper class LazyBlob which aim to prevent loading huge file into RAM.

This new class can be used instead of a Blob with the commit function.

Fix #68

const lazyBlob = await LazyBlob.create("./package.json");

await commit({
  repo,
  title: "Some commit",
  credentials: {
    accessToken: "<access token>",
  },
  operations: [
    {
      operation: "addOrUpdate",
      content: lazyBlob,
      path: "package.json",
    },
  ],
});

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really cool when ready!

My main request is that from the outside, it should behave exactly as a Blob. No need for .init() or dispose(), and instanceof Blob should work.

packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
@Aschen
Copy link
Contributor Author

Aschen commented Mar 17, 2023

@coyotte508 I can get ride of init with the private constructor and static create method as you suggest but for the dispose the only option would be to open the file when reading and slice and then close it.

Here it's DX vs performances so I will let you choose 👍

@coyotte508
Copy link
Member

coyotte508 commented Mar 17, 2023

@coyotte508 I can get ride of init with the private constructor and static create method as you suggest but for the dispose the only option would be to open the file when reading and slice and then close it.

Here it's DX vs performances so I will let you choose 👍

The most calls will be three I think, on large files:

  • a .slice(0, 512) to get a sample to send to /preupload
  • a .stream() to compute the Sha
  • Then whatever fetch will call (probably .stream())

On smaller files, it will be two calls: one .slice(0, 512) and one .arrayBuffer().

Given that, I'd rather prioritize DX on this one ! The overhead should be low. Maybe an optimization would be to directly load small files in memory, if we really care about perf.

@Aschen
Copy link
Contributor Author

Aschen commented Mar 17, 2023

Given that, I'd rather prioritize DX on this one ! The overhead should be low. Maybe an optimization would be to directly load small files in memory, if we really care about perf.

I agree with that 👍 Also, better have more syscall to open/close than unused file descriptor 🙈

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close!

packages/hub/src/utils/sha256.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/commit.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/src/lib/LazyBlob.ts Outdated Show resolved Hide resolved
packages/hub/test.js Outdated Show resolved Hide resolved
@coyotte508
Copy link
Member

For the TS Error: maybe as ReturnType<Blob["stream"]>

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ready after the changes.

I just can't believe this module doesn't already exist on npm (only versions I found loaded everything in memory)

Comment on lines 72 to 76
private async init(): Promise<void> {
const { size } = await this.execute((file) => file.stat());

this.totalSize = size;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

}

stream(): ReadableStream {
return Readable.toWeb(createReadStream(this.path));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Readable.toWeb(createReadStream(this.path));
return Readable.toWeb(createReadStream(this.path, {start: this.start, end: this.end}));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this one on my way back from the beach today 😁

private path: string;
private start: number;
private end: number;
private totalSize: number;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed as well

Comment on lines 31 to 39
if (this.start !== null) {
if (this.end !== null) {
return Math.abs(this.end - this.start);
}

return this.totalSize - this.start;
}

return this.totalSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (this.start !== null) {
if (this.end !== null) {
return Math.abs(this.end - this.start);
}
return this.totalSize - this.start;
}
return this.totalSize;
return this.end - this.start;

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alirght, thanks a lot!

@Aschen
Copy link
Contributor Author

Aschen commented Mar 20, 2023

@coyotte508 What about the documentation? If I understood well it's in the same repo, I can give it a try

@coyotte508
Copy link
Member

coyotte508 commented Mar 20, 2023

@coyotte508 What about the documentation? If I understood well it's in the same repo, I can give it a try

I intend to keep it internal.

A file with require("node:fs") is directly exposed in the exports, it will cause trouble for the use in browsers, at least for the CJS bundle. See how sha256-node is currently dynamically imported and only outside the browser.

So I think maybe a smart way to expose it is to allow passing a URL as an alternative to a Blob - see https://www.geeksforgeeks.org/node-js-url-pathtofileurl-api/ urls and files paths can be equivalent.

It also opens the door for supporting web urls later on. (using Range header, or HEAD calls appropriately to get a slice / the size)

@Aschen
Copy link
Contributor Author

Aschen commented Mar 20, 2023

Yes I saw, I'm familiar with this kind of dual behavior when it's needed to provide a SDK for both node and the browser (See Kuzzle Javascript SDK)

If I understand well, you might want to allows a string representing a path to a local file for ContentSource if it's used in Node.js ?

await commit({
  repo,
  title: "Some commit",
  credentials: {
    accessToken: TEST_ACCESS_TOKEN,
  },
  operations: [
    {
      operation: "addOrUpdate",
      // 1) Only a path to a local file, if it's used from Node.js, a LazyBlob will be used
      path: "test.txt",
    },
    {
      operation: "addOrUpdate",
      // 2) URL to a local file, if it's used from Node.js, a LazyBlob will be used
      path: "file://test.txt",
    },
  ],
});

@coyotte508
Copy link
Member

coyotte508 commented Mar 20, 2023

If I understand well, you might want to allows a string representing a path to a local file for ContentSource if it's used in Node.js ?

When I was mentioning URL, it was literally :)

Basically, change:

type ContentSource = Blob;

To

type ContentSource = Blob | URL;

Btw path is the path inside the repo on HF, it's different

@Aschen
Copy link
Contributor Author

Aschen commented Mar 20, 2023

Ok understood 👍

So if the URL is a local file (starting with file://) then the LazyBlob will be used, am I right?

@coyotte508
Copy link
Member

coyotte508 commented Mar 20, 2023

Ok understood +1

So if the URL is a local file (starting with file://) then the LazyBlob will be used, am I right?

Yea, if url.protocol === 'file:' use LazyBlob (except in browser of course), otherwise throw a TypeError. And later support http: and https:

@Aschen
Copy link
Contributor Author

Aschen commented Mar 20, 2023

Do you want me to do this feature, supporting URL as content, in a separate PR?

@coyotte508
Copy link
Member

Do you want me to do this feature, supporting URL as content, in a separate PR?

Sure! Then maybe you can remove the export in this one and keep the LazyBlob private ? (I was going to do it later today before merging 😁 )

@Aschen
Copy link
Contributor Author

Aschen commented Mar 20, 2023

Do you want me to do this feature, supporting URL as content, in a separate PR?

Sure! Then maybe you can remove the export in this one and keep the LazyBlob private ? (I was going to do it later today before merging grin )

No problem ;-)

@coyotte508 coyotte508 merged commit 7fac846 into huggingface:main Mar 20, 2023
@Aschen Aschen deleted the feat/add-lazy-blob branch March 20, 2023 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a helper to generate Blob (lazily read) from a file path
2 participants