-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from whysage/develop
Update README
- Loading branch information
Showing
2 changed files
with
30 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,30 @@ | ||
# Hash Chunker | ||
|
||
Generator that yields hash chunks for distributed data processing. | ||
|
||
### TLDR | ||
|
||
``` | ||
pip install hash-chunker | ||
``` | ||
|
||
``` | ||
from hash_chunker import HashChunker | ||
chunks = list(HashChunker().get_chunks(chunk_size=1, all_items_count=2)) | ||
assert chunks == [("0000000000", "8000000000"), ("8000000000", "ffffffffff")] | ||
``` | ||
|
||
### Description | ||
|
||
Imagine a situation when you need to process huge amount data rows in parallel. | ||
Each data row has a hash field and the task is to use it for chunking. | ||
|
||
Possible reasons for using hash field and not int id field: | ||
- No auto increment id field. | ||
- Id field has many blank lines (1,2,3, 100500, 100501, 1000000). | ||
- Chunking by id will break data that must be in one chunk to different chunks | ||
(in user behavioral analytics id can be autoincrement for all users actions and | ||
user_session hash is linked to concrete user, so if we chunk by id one user session may | ||
not be in one chunk). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
name = "hash_chunker" | ||
homepage = "https://github.com/whysage/hash_chunker" | ||
repository = "https://github.com/whysage/hash_chunker" | ||
version = "0.1.1" | ||
version = "0.1.2" | ||
description = "Generator that yields hash chunks for distributed data processing." | ||
authors = ["Volodymyr Kochetkov <[email protected]>"] | ||
license = "MIT" | ||
|