Skip to content

Commit

Permalink
feat: enable ishard to use regex-based external key map through dsort
Browse files Browse the repository at this point in the history
Signed-off-by: Tony Chen <[email protected]>
  • Loading branch information
Nahemah1022 committed Aug 6, 2024
1 parent d0510d8 commit 5580f06
Show file tree
Hide file tree
Showing 8 changed files with 297 additions and 133 deletions.
32 changes: 31 additions & 1 deletion cmd/ishard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ To give a quick example, `a/b/c/toyota.jpeg` and `a/b/c/toyota.json` from an ori
- `-shard_template="prefix-{0000..4096..8}-suffix"`: Generate output shards `prefix-0000-suffix`, `prefix-0008-suffix`, `prefix-00016-suffix`, and so on.
- `-shard_template="prefix-%06d-suffix"`: Generate output shards `prefix-000000-suffix`, `prefix-000001-suffix`, `prefix-000002-suffix`, and so on.
- `-shard_template="prefix-@00001-gap-@100-suffix"`: Generate output shards `prefix-00001-gap-001-suffix`, `prefix-00001-gap-002-suffix`, and so on.
- `ekm`: Specify an external key map (EKM) to pack samples into shards based on customized regex categories, either as a JSON string or a path to a JSON file.
- `ekm="/path/to/ekm.json"`: Specify EKM as a path to a JSON file.
- `ekm="{\"fish-%d.tar\": [\"train/n01440764.*\", \"train/n01443537.*\"], \"dog-%d.tar\": [\"train/n02084071.*\", \"train/n02085782.*\"]}"`: Specify EKM as an inline JSON string.
- `-ext`: The extension used for generating output shards. Supports `.tar`, `.tgz`, `.tar.gz`, `.zip`, and `.tar.lz4` formats.
- `-sample_exts`: A comma-separated list of required extensions for all samples in the dataset. See -missing_extension_action for handling missing extensions.
- `-missing_extension_action`: Specifies the action to take when an expected extension is missing from a sample. Options are: `abort` | `warn` | `ignore` | `exclude`.
Expand Down Expand Up @@ -281,7 +284,7 @@ ImageNet/Data/val/n00000333/ILSVRC2012_val_00007175.JPEG 30.00KiB
5. **Generate output shards name using template:** You can use various templates to generate output shards using `-shard_template`. For example:

```sh
$ ./ishard-cli -src_bck=ais://ImageNet -dst_bck=ais://ImageNet-out -shard_template="pre-{0000..8192..8}-suf"
$ ./ishard -src_bck=ais://ImageNet -dst_bck=ais://ImageNet-out -shard_template="pre-{0000..8192..8}-suf"

NAME SIZE
pre-0000-suf.tar 1.07MiB
Expand All @@ -298,6 +301,33 @@ ImageNet/Data/val/n00000333/ILSVRC2012_val_00007175.JPEG 30.00KiB
...
```

6. **Generate output shards name using template:** You can pack samples into shards based on customized categories using `-ekm`. For example, the following example EKM file will pack all samples matching to these specified templates into their corresponding category.

```json
{
"fish-%d.tar": [
"train/n01440764.*", // tench
"train/n01443537.*", // goldfish
...
],
"dog-%d.tar": [
"train/n02084071.*", // toy terrier
"train/n02085782.*", // Japanese spaniel
"train/n02085936.*", // Maltese dog
...
],
"bird-%d.tar": [
"train/n01514668.*", // cock
"train/n01514859.*", // hen
...
],
}
```

```sh
$ ./ishard -src_bck=ais://ImageNet -dst_bck=ais://ImageNet-out -ekm="/path/to/category.json"
```

### Incorrect Usages

1. The number of generated output shards can't fit into specified `shard-template`.
Expand Down
63 changes: 31 additions & 32 deletions cmd/ishard/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ module github.com/NVIDIA/aistore/cmd/ishard
go 1.22.3

require (
github.com/NVIDIA/aistore v1.3.23
github.com/NVIDIA/aistore v1.3.24-0.20240803001017-7a15bb331ebe
github.com/json-iterator/go v1.1.12
github.com/vbauerster/mpb/v4 v4.12.2
)

Expand All @@ -13,44 +14,43 @@ require (
github.com/acarl005/stripansi v0.0.0-20180116102854-5a71ef0e047d // indirect
github.com/andybalholm/brotli v1.1.0 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/emicklei/go-restful/v3 v3.12.0 // indirect
github.com/go-logr/logr v1.4.1 // indirect
github.com/emicklei/go-restful/v3 v3.12.1 // indirect
github.com/go-logr/logr v1.4.2 // indirect
github.com/go-openapi/jsonpointer v0.21.0 // indirect
github.com/go-openapi/jsonreference v0.21.0 // indirect
github.com/go-openapi/swag v0.23.0 // indirect
github.com/go-task/slim-sprig v0.0.0-20230315185526-52ccab3ef572 // indirect
github.com/go-task/slim-sprig/v3 v3.0.0 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/protobuf v1.5.4 // indirect
github.com/google/gnostic-models v0.6.8 // indirect
github.com/google/go-cmp v0.6.0 // indirect
github.com/google/gofuzz v1.2.0 // indirect
github.com/google/pprof v0.0.0-20210720184732-4bb14d4b1be1 // indirect
github.com/google/pprof v0.0.0-20240528025155-186aa0362fba // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/imdario/mergo v0.3.16 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/karrick/godirwalk v1.17.0 // indirect
github.com/klauspost/compress v1.17.7 // indirect
github.com/klauspost/compress v1.17.9 // indirect
github.com/lufia/iostat v1.2.1 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/onsi/ginkgo/v2 v2.17.1 // indirect
github.com/onsi/gomega v1.32.0 // indirect
github.com/onsi/ginkgo/v2 v2.19.0 // indirect
github.com/onsi/gomega v1.33.1 // indirect
github.com/philhofer/fwd v1.1.2 // indirect
github.com/pierrec/lz4/v3 v3.3.5 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/client_golang v1.19.0 // indirect
github.com/prometheus/client_model v0.6.0 // indirect
github.com/prometheus/common v0.51.1 // indirect
github.com/prometheus/procfs v0.13.0 // indirect
github.com/prometheus/client_golang v1.19.1 // indirect
github.com/prometheus/client_model v0.6.1 // indirect
github.com/prometheus/common v0.54.0 // indirect
github.com/prometheus/procfs v0.15.1 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/teris-io/shortid v0.0.0-20220617161101-71ec9f2aa569 // indirect
github.com/tidwall/btree v1.7.0 // indirect
github.com/tidwall/buntdb v1.3.0 // indirect
github.com/tidwall/buntdb v1.3.1 // indirect
github.com/tidwall/gjson v1.17.1 // indirect
github.com/tidwall/grect v0.1.4 // indirect
github.com/tidwall/match v1.1.1 // indirect
Expand All @@ -59,28 +59,27 @@ require (
github.com/tidwall/tinyqueue v0.1.1 // indirect
github.com/tinylib/msgp v1.1.9 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/valyala/fasthttp v1.52.0 // indirect
golang.org/x/crypto v0.22.0 // indirect
golang.org/x/net v0.24.0 // indirect
golang.org/x/oauth2 v0.18.0 // indirect
golang.org/x/sync v0.6.0 // indirect
golang.org/x/sys v0.19.0 // indirect
golang.org/x/term v0.19.0 // indirect
golang.org/x/text v0.14.0 // indirect
github.com/valyala/fasthttp v1.54.0 // indirect
golang.org/x/crypto v0.24.0 // indirect
golang.org/x/net v0.26.0 // indirect
golang.org/x/oauth2 v0.21.0 // indirect
golang.org/x/sync v0.7.0 // indirect
golang.org/x/sys v0.21.0 // indirect
golang.org/x/term v0.21.0 // indirect
golang.org/x/text v0.16.0 // indirect
golang.org/x/time v0.5.0 // indirect
golang.org/x/tools v0.18.0 // indirect
google.golang.org/appengine v1.6.8 // indirect
google.golang.org/protobuf v1.33.0 // indirect
golang.org/x/tools v0.22.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
k8s.io/api v0.29.3 // indirect
k8s.io/apimachinery v0.29.3 // indirect
k8s.io/client-go v0.29.3 // indirect
k8s.io/api v0.30.2 // indirect
k8s.io/apimachinery v0.30.2 // indirect
k8s.io/client-go v0.30.2 // indirect
k8s.io/klog/v2 v2.120.1 // indirect
k8s.io/kube-openapi v0.0.0-20240322212309-b815d8309940 // indirect
k8s.io/metrics v0.29.3 // indirect
k8s.io/utils v0.0.0-20240310230437-4693a0247e57 // indirect
k8s.io/kube-openapi v0.0.0-20240521193020-835d969ad83a // indirect
k8s.io/metrics v0.30.2 // indirect
k8s.io/utils v0.0.0-20240502163921-fe8a2dddb1d0 // indirect
sigs.k8s.io/json v0.0.0-20221116044647-bc3834ca7abd // indirect
sigs.k8s.io/structured-merge-diff/v4 v4.4.1 // indirect
sigs.k8s.io/yaml v1.4.0 // indirect
Expand Down
Loading

0 comments on commit 5580f06

Please sign in to comment.