Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native git support: lsRefs(), sparseCheckout(), GitPathControl #1764

Merged
merged 29 commits into from
Oct 7, 2024

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Sep 14, 2024

Motivation

Related to #1787

Adds a set of TypeScript functions that support the native git protocol and can power a sparse checkout feature. This is the basis for a faster, more user-friendly git integration. No more guessing repository paths. Just provide the repo URL, browse the files, and tell Playground which directories are plugins, themes, etc.

Technically, this PR performs git sparse checkout using just JavaScript and a generic CORS proxy.

This PR doesn't provide any user-facing feature yet. However, it paves the way to features like:

  • Checkout any git repo, even non-GitHub ones, without going through the OAuth flow
  • Retrieve a subset of the files directly from the repo and without going through zipballs.
  • Provide a visual git repo browser (instead of asking the user to manually type the path)
  • Introduce a new Blueprint resource type: git repo
  • Fetch the names of all the repository branches (or just the branches with the specified prefix)
  • (future) commit and push to any git repo, even non-GitHub ones

Notable points of this PR

  • Exposes the sparseCheckout(), lsRefs(), and listFiles() functions from the @wp-playground/storage package. I'm not yet sure whether we need a dedicated @wp-playground/git package or not.
  • Ships basic unit test coverage for those functions.
  • Silences a few warnings in the CORS proxy. CC @brandonpayton we may not want to do that in the production release.
  • Adds isomorphic-git as a git submodules in the /isomorphic-git path. We can't rely in the published npm package because it doesn't export the internal APIs we need to use here.
  • Adds a bunch of WIP components in @wp-playground/components. They're not used anywhere on the website yet and I'd rather keep them moving with the project than isolate them in a PR until they're perfect. We'll need some accessibility and mobile testing before using them in the webapp, though.

How does it even work?

Let me quote my own article:

Running a Git Client in the browser

The good news was isomorphic-git, wasm-git, and a few other projects were already running Git in the browser. The bad news was none of them supported fetching a subset of files via sparse checkout. You’d still have to download 20MB of data even if you only wanted 100KB.

However, Everything the desktop Git client does, including sparse checkouts, can be done via HTTP by requesting URLs like https://github.com/WordPress/wordpress-playground.git.

Git documentation was… less than helpful, but eventually it worked! A few hours later I was running Git commands by sending GET and POST requests to the repository-URLs.

Fetching a hash of the branch

The first command I needed was ls-refs to get the SHA1 hash of the right git branch. Here’s how you can get it with fetch() for the HEAD branch of the WordPress/wordpress-playground repo:

const response = await fetch(
  'https://github.com/WordPress/gutenberg.git/git-upload-pack',
  {
    method: 'POST',
    headers: {
        'Accept': 'application/x-git-upload-pack-advertisement',
        'content-type': 'application/x-git-upload-pack-request',
        'Git-Protocol': 'version=2'
    },
    body: [
        `0014command=ls-refs\n`,
      // ^^^^ line length in hex
        `0015agent=git/2.37.3\n`,
        `0017object-format=sha1\n`,
        '0001',
      // ^^^^ command separator
        // Filter the results to only contain the HEAD branch,
        // otherwise it will return all the branches and
        // tags which may require downloading many 
        // megabytes of data:
        `0009peel\n`,
        `0014ref-prefix HEAD\n`,
        '0000',
      // ^^^^ end of request
    ].join(""),
  }
);

I won’t go into details of the Git protocol – the point is with a few special headers and lines you can be a Git client. If you paste that fetch() in your devtools while on GitHub.com, it would return a response similar to this:

0032950f5c8239b6e78e9051ec5e845bac5aa863c4cb HEAD
0000

Good! That’s our commit hash.

Fetching a list of objects at a specific commit
With this, we can fetch the list of objects in that branch:

fetch("https://github.com/wordpress/gutenberg/git-upload-pack", {
  "headers": {
    "accept": "application/x-git-upload-pack-advertisement",
    "content-type": "application/x-git-upload-pack-request",
  },
  "referrer": "http://localhost:8000/",
  "referrerPolicy": "strict-origin-when-cross-origin",
  "body": [
      `0088want 950f5c8239b6e78e9051ec5e845bac5aa863c4cb multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3 filter \n`,
      `0015filter blob:none\n`,
      // ^ sparse checkout secret says.
      // only fetches a list of objects without
      // their content
      `0035shallow 950f5c8239b6e78e9051ec5e845bac5aa863c4cb\n`,
      `000ddeepen 1\n`,
      `0000`,
      `0009done\n`,
      `0009done\n`,
  ].join(""),
  "method": "POST"
});

And here’s the response:

00000008NAK
0026�Enumerating objects: 2189, done.
0025�Counting objects:   0% (1/2189)
...
0032�Compressing objects: 100% (1568/1568), done.
2004�PACK��(binary data)
0040 Total 2189 (delta 1), reused 1550 (delta 0), pack-reused 0
0006��0000

The binary data after PACK is a compressed list of all objects the repository had at commit 950f5c8239b6e78e9051ec5e845bac5aa863c4cb. It is not a list of files that were committed in 950f5c. It’s all files.

The pack format is a binary blob. It’s similar to ZIP in that it encodes of a series of objects encoded as a binary header followed by binary data. Here’s an approximate visual to help grok the idea:

PACK format – inaccurate explanation,
Pack consists of the string "PACK" and binary data structured roughly as follows:

 ___________________________________
|                                   |
|        ASCII string "PACK"        |
|        Binary data starts         |
|           Pack Header             |
|___________________________________|
|                                   |
|        Offset 0x0010              |
|          Object 1 Header          |  (Object type, hash,
|                                   |   data length, etc.)
|        ________________           |
|       |                |          |
|       |  Object 1 Data |          |  (Gzipped data)
|       |________________|          |
|                                   |
|        Offset 0x0050              |
|          Object 2 Header          |  
|                                   | 
|        ________________           |
|       |                |          |
|       |  Object 2 Data |          |  (Gzipped data)
|       |________________|          |
|___________________________________|
|                                   |
|           Pack Footer             |
|         Binary data ends          |
|___________________________________|

The decoding is tedious so I used the decoder provided by isomorphic Git package:

const iterator = streamToIterator(await response.body);
const parsed = await parseUploadPackResponse(iterator);
const packfile = Buffer.from(await collect(parsed.packfile));

const index = await GitPackIndex.fromPack({
    pack: packfile
});

The parsed index object provides information about all the objects encoded in the received packfile. Let’s peek inside:

{
  // ...
  "hashes": [
    "5f4f0a5367476fdb7c98ffa5fa35300ec4c3f48b",
    "950f5c8239b6e78e9051ec5e845bac5aa863c4cb",
    // ...
  ],
  "offsets": {
    "5f4f0a5367476fdb7c98ffa5fa35300ec4c3f48b": 12,
    "950f5c8239b6e78e9051ec5e845bac5aa863c4cb": 181,
    // ...
  },
  "offsetCache": {
    "12": {
      "type": "tree",
      "object": "100644 async-http-download.php\u0000��p4��\u0014�g\u0015i��\u0004��\\���100644 async-http.php\u0000�\n�8K�RT������F\u001b8�� (more binary data)"
    },
    // ...
  },
  "readDepth": 4,
  "externalReadDepth": 0
}

Each object has a type and some data. The decoder stored some objects in the offsetCache, and kept track of others in form of a hash => offset in packfile mapping.

Let’s read the details of the commit from our parsed index:

> const commit = await index.read({
    oid: '950f5c8239b6e78e9051ec5e845bac5aa863c4cb'
  });

{
  "type": "commit",
  "object": "tree c7b8440c83b8c987895f9a1949650eb60bccd2ec\nparent b6132f2d381865353e09edf88aa64a0dd042811a\nauthor Adam Zieliński <[email protected]> 1717689108 +0200\ncommitter Adam Zieliński <[email protected]> 1717689108 +0200\n\nUpdate rebuild workflow\n"
}

It’s the object type, the hash, and the uncompressed object bytes which, in this case, provide us commit details in a specific microformat. From here, we can get the tree hash and look for its details in the same index we’ve already downloaded:

> const tree = await index.read({ oid: "c7b8440c83b8c987895f9a1949650eb60bccd2ec" })

{
  "type": "tree",
  "object": "40000 .github\u0000_O\nSgGo�|����50\u000e���40000 (... binary data ...)"
}

The contents of the tree object is a list of files in the repository. Just like with commit, tree details are encoded in their own microformat. Luckily, isomorphic-git ships relevant decoders:

> GitTree.from(result.object).entries()
[
  {
    "mode": "040000",
    "path": ".github",
    "oid": "ece277ec006eb517d5c5399d7a5c00b7e61018f1",
    "type": "blob"
  },
  {
    "mode": "100644",
    "path": "readme.txt",
    "oid": "3fe6e3aaf1dc4df204be575041383fc8e2e1e070",
    "type": "blob"
  },
  {
    "mode": "040000",
    "path": "src",
    "oid": "dbc84f20ee64fbd924617b41ee0e66128c9a8d97",
    "type": "tree"
  },
  // ...
]

Yay! That’s the list of files and directories in the repository root with there hashes! From here we can recursively retrieve the ones relevant for our sparse checkout.

Fetching full files from specific paths

We’re finally ready to checkout a few particular paths. Let’s ask for a blob at readme.txt and a tree at docs/tools:

const response = fetch("https://github.com/wordpress/gutenberg/git-upload-pack", {
  "headers": {
    "accept": "application/x-git-upload-pack-advertisement",
    "content-type": "application/x-git-upload-pack-request",
  },
  "body": [
      `0081want 28facb763312f40c9ab3251fb91edb87c8476cf9 multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3\n`,
      `0081want 3fe6e3aaf1dc4df204be575041383fc8e2e1e070 multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3\n`,
      `00000009done`
  ].join(""),
  "method": "POST"
});

The response is another index, but this time each blob comes with binary contents. Some decoding and recursive processing later, we finally get this:

{
    "readme.txt": "=== Gutenberg ===\nContri (...)",
    "docs/tool": {
        "index.js": "/**\n * External depe (...)",
        "manifest.js": "/* eslint no-console (...)"
    }
}

Yay! It took some effort, but it was worth it!

Cors proxy and other notes

You’ll still need to run a CORS proxy. The fetch() examples above will work if you try them in devtools on github.com, but you won’t be able to just use them on your site. Git API typically does not expose the Access-Control-* headers required by the browser to run these requests.

So we need a server after all. Was this a failure, then? No! A CORS proxy is cheaper, simpler, and safer to maintain than a Git service. Also, it can fetch all the files in 3 fetch() requests instead of two requests per file like the GitHub REST API requires.

Try it yourself

I’ve shared a functional demo that includes a CORS proxy in this repository on GitHub: https://github.com/adamziel/git-sparse-checkout-in-js

Testing instructions

  • Start two terminals
  • Run nx dev playground-components in the first one
  • Run nx start playground-php-cors-proxy in the second one to start the PHP Cors proxy
  • Go to http://localhost:5173/ and play with the UI
  • Play with an early demo of git repository browser shipped in this PR:
CleanShot.2024-09-17.at.21.36.37.mp4

@adamziel adamziel changed the title Git sparseCheckout function Native git support: lsRefs(), sparseCheckout(), GitPathControl Sep 17, 2024
@adamziel adamziel marked this pull request as ready for review October 7, 2024 19:07
@adamziel adamziel requested a review from a team as a code owner October 7, 2024 19:07
adamziel added a commit that referenced this pull request Oct 7, 2024
## Description

Adds a Directory resource type to enable loading file trees from git
repositories, local hard drive, deeply nested zip archives etc:

```json
{
	"steps": [
		{
			"step": "installPlugin",
			"pluginData": {
				"resource": "git:directory",
				"repositoryUrl": "https://github.com/WordPress/wordpress-playground.git",
				"ref": "HEAD",
				"path": "packages/docs"
			}
		},
		{
			"step": "installPlugin",
			"pluginData": {
				"resource": "literal:directory",
				"name": "hello-world",
				"files": {
					"README.md": "Hello, World!",
					"index.php": "<?php\n/**\n* Plugin Name: Hello World\n* Description: A simple plugin that says hello world.\n*/",
				}
			}
		}
	]
}
```

## Motivation

This PR opens the door to:

* Seamless Git integration. Import path mapping from git to Playground
is now just a few steps referencing specific git directories. No more
custom logic required!
* Blueprint-based site imports and exports without any Playground
webapp-specific logic.
* Runtime-specific "resource overrides", e.g.
`--resource-override=GUTENBERG:./gutenberg.zip` in CLI to test a
Blueprint with a my local version of Gutenberg. The same logic would be
used by the Blueprints builder to use files selected via `<input
type="file">` controls.

### Schema 

Every step can declare which kinds of resources it accepts (file-based
resources vs directory-based resources). Using a single `pluginData`
property in the `installPlugin` step means less choices for the
developer. It also makes local resource overrides easy, e.g. we could
tell Playground CLI to load a local Gutenberg directory instead of a
remote Gutenberg zip. This wouldn't be as easy had we used separate
options for passing ZIP-based and directory-based resources.

On one hand, `pluginData` is less informative than `pluginZipFile`. On
the other, the name accommodates for non-zip resources such as
directories.

## Developer notes about specific API changes introduced in this PR

This PR changes introduces a new `literal:directory` resource that can
be used in Blueprints as follows:

```json
{
	"steps": [
		{
			"step": "installPlugin",
			"pluginData": {
				"resource": "literal:directory",
				"name": "hello-world",
				"files": {
					"README.md": "Hello, World!",
					"index.php": "<?php\n/**\n* Plugin Name: Hello World\n* Description: A simple plugin that says hello world.\n*/",
				}
			}
		}
	]
}
```

Or via the JS API:

```ts
await installTheme(php, {
	themeData: {
		name: 'test-theme',
		files: {
			'index.php': `/**\n * Theme Name: Test Theme`,
		},
	},
	ifAlreadyInstalled: 'overwrite',
	options: {
		activate: false,
	},
});
```

It also introduces a new `writeFiles` step:

```ts
{
	"steps": [
		{
			"step": "writeFiles",
			"writeToPath": "/wordpress/wp-content/plugins/my-plugin",
			"filesTree": {
				"name": "my-plugin",
				"files": {
					"index.php": "<?php echo '<a>Hello World!</a>'; ?>",
					"public": {
						"style.css": "a { color: red; }"
					}
				}
			}
		}
	]
}
```

Specific changes:

* Adds a `Resource<Directory>` resource type that provides a `name:
string` and `files: FileTree`.
* Renames `pluginZipFile` to `pluginData` in the `installPlugin` step
* Renames `themeZipFile` to `themeData` in the `installPlugin` step
* Adds a new `writeFiles` step for writing entire directory trees
* Adds a new `literal:directory` resource type where an entire file tree
can be specified inline
* Adds a new `git:directory` resource type that throws an error for now,
but will load arbitrary directories from git repositories once
#1764 lands

## Remaining work

- [x] Discuss the scope and the ideas
- [x] Add unit tests
- [x] Update the documentation
- [x] Adjust the `installPlugin` and `installTheme` step for
compatibility with it's former signature. Ensure the existing packages
consuming those functions from the `@wp-playground/blueprints` package
will continue to work.
- [x] Confirm we can safely omit streaming from the system design at
this point without setting ourselves up for a grand refactor a few
months down the road.
* I think we can! Streaming support could be an addition to the system,
not a change in how the system works. For example, there could be a new
`DirectoryStream` resource type producing an `AsyncDirectoryIterator`
with streamable `File` or `Blob` objects as its leafs. It would work
nicely with remote APIs or the ZIP streaming plumbing in
`@php-wasm/stream-compression`. Any existing code expecting a
`DirectoryResource` should be relatively easily adaptable to use these
async iterators instead.

## Follow-up work

- [ ] Include actual git support once the [Git sparse checkout
PR](#1764) lands
- [ ] Ship a Playground CORS proxy to enable using git checkout in the
webapp
- [ ] Once we have a use-able `git:directory` resource, expand the
developer notes from this PR and other related PRs and write a post on
https://make.wp.org/playground

## Tangent – Streaming and a shorthand URL notation

Without streaming, the entire directory must be loaded into memory. Our
git sparse checkout implementation buffers everything anyway, but we
will want to stream-read directory resources in the future. For example:

```js
{
	"steps": [
		// Stream plugin files directly from a doubly zipped Git artifact
		{
			"step": "installPlugin",
			"pluginData": {
				"resource": "zip:github-artifact",
				"zipFile": {
					"resource": "url",
					"url": "https://github.com/WordPress/guteneberg/pr/54713/artifacts/build.zip"
				}
			}
		}
	}
}
```

That's extremely verbose, I'd love to explore a shorthand notation. One
idea would be to make it a valid
[URI](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier) shaped
after the [data URL
syntax](https://developer.mozilla.org/en-US/docs/Web/URI/Schemes/data):

```js
const dataUri = `data:text/html;base64,%3Cscript%3Ealert%28%27hi%27%29%3B%3C%2Fscript%3E`;
const githubArtifactUri = `zip-github-artifact+url:https://github.com/WordPress/gutenberg/pr/54713/artifacts/build.zip`;
const gitResourceUri = `git:branch=HEAD;path=src,https://github.com/WordPress/hello-dolly.git`;
```

It wouldn't allow easy composition of the resources, e.g. a directory
inside a zip sourced from a GitHub repo. Maybe that's for the best,
though, since such a string would be extremely dense and difficult for
humans to read. The object-based syntax might still be the most
convenient way of to declare those.
@adamziel adamziel merged commit 8639616 into trunk Oct 7, 2024
8 of 9 checks passed
@adamziel adamziel deleted the git-sparse-checkout branch October 7, 2024 22:15
@adamziel
Copy link
Collaborator Author

adamziel commented Oct 7, 2024

🥁 This took three months to build and merge, but we got there!

adamziel added a commit that referenced this pull request Oct 7, 2024
Related to #1787, Follows up on #1793

Implements GitDirectoryResource to enable loading files directly from
git repositories as follows:

```ts
{
	"landingPage": "/guides/for-plugin-developers.md",
	"steps": [
		{
			"step": "writeFiles",
			"writeToPath": "/wordpress/guides",
			"filesTree": {
				"resource": "git:directory",
				"url": "https://github.com/WordPress/wordpress-playground.git",
				"ref": "trunk",
				"path": "packages/docs/site/docs/main/guides"
			}
		}
	]
}
```

 ## Implementation details

Uses git client functions merged in
#1764 to sparse
checkout the requested files. It also leans on the PHP CORS proxy which
is now started as a part of the `npm run dev` command.

The CORS proxy URL is configurable per `compileBlueprint()` call so that each
Playground runtime may choose to either use it or not. For example, it
wouldn't be very useful in the CLI version of Playground.

 ## Testing plan

Go to
`http://localhost:5400/website-server/#{%20%22landingPage%22:%20%22/guides/for-plugin-developers.md%22,%20%22steps%22:%20[%20{%20%22step%22:%20%22writeFiles%22,%20%22writeToPath%22:%20%22/wordpress/guides%22,%20%22filesTree%22:%20{%20%22resource%22:%20%22git:directory%22,%20%22url%22:%20%22https://github.com/WordPress/wordpress-playground.git%22,%20%22ref%22:%20%22trunk%22,%20%22path%22:%20%22packages/docs/site/docs/main/guides%22%20}%20}%20]%20}`
and confirm Playground loads a markdown file.
@adamziel adamziel mentioned this pull request Oct 7, 2024
@bgrgicak
Copy link
Collaborator

bgrgicak commented Oct 8, 2024

@adamziel after pulling trunk, the isomorphic-git code was missing.
I'm not sure if this should happen automatically or not, I just cloned https://github.com/adamziel/isomorphic-git.git and it started working.

Should we update our local dev instructions or is there something we can automate to make this code available to new developers?

adamziel added a commit that referenced this pull request Oct 8, 2024
Related to
#1787, Follows
up on #1793

Implements GitDirectoryResource to enable loading files directly from
git repositories as follows:

```ts
{
	"landingPage": "/guides/for-plugin-developers.md",
	"steps": [
		{
			"step": "writeFiles",
			"writeToPath": "/wordpress/guides",
			"filesTree": {
				"resource": "git:directory",
				"url": "https://github.com/WordPress/wordpress-playground.git",
				"ref": "trunk",
				"path": "packages/docs/site/docs/main/guides"
			}
		}
	]
}
```

 ## Implementation details

Uses git client functions merged in
#1764 to sparse
checkout the requested files. It also leans on the PHP CORS proxy which
is now started as a part of the `npm run dev` command.

The CORS proxy URL is configurable per `compileBlueprint()` call so that
each Playground runtime may choose to either use it or not. For example,
it wouldn't be very useful in the CLI version of Playground.

 ## Testing plan

Go to

```
http://localhost:5400/website-server/#{%20%22landingPage%22:%20%22/guides/for-plugin-developers.md%22,%20%22steps%22:%20[%20{%20%22step%22:%20%22writeFiles%22,%20%22writeToPath%22:%20%22/wordpress/guides%22,%20%22filesTree%22:%20{%20%22resource%22:%20%22git:directory%22,%20%22url%22:%20%22https://github.com/WordPress/wordpress-playground.git%22,%20%22ref%22:%20%22trunk%22,%20%22path%22:%20%22packages/docs/site/docs/main/guides%22%20}%20}%20]%20}
```

And confirm the Playground loads a markdown file.
@adamziel
Copy link
Collaborator Author

adamziel commented Oct 8, 2024

@bgrgicak oh, good point that existing clones won't work without pulling the submodules. You can do that with git pull --recurse-submodules. New developers are already covered by this PR – all the dev instructions are updated to include the --recurse-submodules CLI flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants