Skip to content

Commit

Permalink
Merge pull request #2 from whysage/develop
Browse files Browse the repository at this point in the history
Update README
  • Loading branch information
whysage authored Jul 28, 2022
2 parents 4e2376e + 1279cb9 commit 383508e
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 1 deletion.
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,30 @@
# Hash Chunker

Generator that yields hash chunks for distributed data processing.

### TLDR

```
pip install hash-chunker
```

```
from hash_chunker import HashChunker
chunks = list(HashChunker().get_chunks(chunk_size=1, all_items_count=2))
assert chunks == [("0000000000", "8000000000"), ("8000000000", "ffffffffff")]
```

### Description

Imagine a situation when you need to process huge amount data rows in parallel.
Each data row has a hash field and the task is to use it for chunking.

Possible reasons for using hash field and not int id field:
- No auto increment id field.
- Id field has many blank lines (1,2,3, 100500, 100501, 1000000).
- Chunking by id will break data that must be in one chunk to different chunks
(in user behavioral analytics id can be autoincrement for all users actions and
user_session hash is linked to concrete user, so if we chunk by id one user session may
not be in one chunk).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
name = "hash_chunker"
homepage = "https://github.com/whysage/hash_chunker"
repository = "https://github.com/whysage/hash_chunker"
version = "0.1.1"
version = "0.1.2"
description = "Generator that yields hash chunks for distributed data processing."
authors = ["Volodymyr Kochetkov <[email protected]>"]
license = "MIT"
Expand Down

0 comments on commit 383508e

Please sign in to comment.