forked from dieforfree/md2zhihu
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
286a946
commit 703cb8f
Showing
11 changed files
with
374 additions
and
357 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions | ||
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions | ||
|
||
name: Python package | ||
|
||
on: | ||
push: | ||
pull_request: | ||
|
||
jobs: | ||
build: | ||
|
||
runs-on: ubuntu-latest | ||
strategy: | ||
matrix: | ||
python-version: [3.6, 3.7, 3.8] | ||
|
||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install flake8 pytest | ||
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi | ||
if [ -f test-requirements.txt ]; then pip install -r test-requirements.txt; fi | ||
- name: Install apt dependencies | ||
run: | | ||
if [ -f packages.txt ]; then cat packages.txt | xargs sudo apt-get install; fi | ||
- name: Lint with flake8 | ||
run: | | ||
# stop the build if there are Python syntax errors or undefined names | ||
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics | ||
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide | ||
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics | ||
- name: Test with pytest | ||
run: | | ||
pip install setuptools wheel twine | ||
cp setup.py .. | ||
( | ||
cd .. | ||
python setup.py sdist bdist_wheel | ||
pip install dist/*.tar.gz | ||
) | ||
PYTHONPATH="$(cd ..; pwd)" pytest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
pandoc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
scikit-image |
Empty file.
Binary file not shown.
File renamed without changes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
--- | ||
|
||
refs: | ||
- "slim": https://github.com/openacid/slim "slim" | ||
- "slimarray": https://github.com/openacid/slimarray "slimarray" | ||
|
||
--- | ||
|
||
# 场景和问题 | ||
|
||
在时序数据库, 或列存储为基础的系统中, 很常见的形式就是存储一个整数数组, | ||
例如 [slim] 这个项目按天统计的 star 数: | ||
|
||
 | ||
 | ||
|
||
|
||
我们可以利用数据分布的特点, 将整体数据的大小压缩到**几分之一**. | ||
|
||
| Data size | Data Set | gzip size | slimarry size | avg size | ratio | | ||
| --: | :-- | --: | :-- | --: | --: | | ||
| 1,000 | rand u32: [0, 1000] | x | 824 byte | 6 bit/elt | 18% | | ||
| 1,000,000 | rand u32: [0, 1000,000] | x | 702 KB | 5 bit/elt | 15% | | ||
| 1,000,000 | IPv4 DB | 2 MB | 2 MB | 16 bit/elt | 50% | | ||
| 600 | [slim][] star count | 602 byte | 832 byte | 10 bit/elt | 26% | | ||
|
||
在达到gzip同等压缩率的前提下, 构建 slimarray 和 访问的性能也非常高: | ||
- 构建 slimarray 时, 平均每秒可压缩 6百万 个数组元素; | ||
- 读取一个数组元素平均花费 7 ns/op. | ||
- 构建 slimarray 时, 平均每秒可压缩 6百万 个数组元素; | ||
- 读取一个数组元素平均花费 `7 ns/op`. | ||
|
||
🤔!!! | ||
|
||
按照这种思路, **在给定数组中找到一条曲线来描述点的趋势,** | ||
**再用一个比较小的delta数组修正曲线到实际点的距离, 得到原始值, 就可以实现大幅度的数据压缩. 而且所有的数据都无需解压全部数据就直接读取任意一个.** | ||
|
||
# 找到趋势函数 | ||
|
||
寻找这样一条曲线就使用线性回归, | ||
例如在 [slimarray] 中使用2次曲线 `f(x) = β₁ + β₂x + β₃x²`, 所要做的就是确定每个βᵢ的值, | ||
以使得`f(xⱼ) - yⱼ`的均方差最小. xⱼ是数组下标0, 1, 2...; yⱼ是数组中每个元素的值. | ||
|
||
$$ | ||
X = \begin{bmatrix} | ||
1 & x_1 & x_1^2 \\ | ||
1 & x_2 & x_2^2 \\ | ||
\vdots & \vdots & \vdots \\ | ||
1 & x_n & x_n^2 | ||
\end{bmatrix} | ||
, | ||
\vec{\beta} = | ||
\begin{bmatrix} | ||
\beta_1 \\ | ||
\beta_2 \\ | ||
\beta_3 \\ | ||
\end{bmatrix} | ||
, | ||
Y = | ||
\begin{bmatrix} | ||
y_1 \\ | ||
y_2 \\ | ||
\vdots \\ | ||
y_n | ||
\end{bmatrix} | ||
$$ | ||
|
||
|
||
`spanIndex = OnesCount(bitmap & (1<<(i/16) - 1))` | ||
|
||
## 读取过程 | ||
|
||
读取过程通过找span, 读取span配置,还原原始数据几个步骤完成, 假设 slimarray 的对象是`sa`: | ||
|
||
- 通过下标`i` 得到 spanIndex: `spanIndex = OnesCount(sa.bitmap & (1<<(i/16) - 1))`; | ||
- 通过 spanIndex 得到多项式的3个系数: `[b₀, b₁, b₂] = sa.polynomials[spanIndex: spanIndex + 3]`; | ||
- 读取 delta 数组起始位置, 和 delta 数组中每个 delta 的 bit 宽度: `config=sa.configs[spanIndex]`; | ||
- delta 的值保存在 delta 数组的`config.offset + i*config.width`的位置, 从这个位置读取`width`个 bit 得到 delta 的值. | ||
- 计算 `nums[i]` 的值: `b₀ + b₁*i + b₂*i²` 再加上 delta 的值. | ||
|
||
简化的读取逻辑如下: | ||
|
||
```go | ||
func (sm *SlimArray) Get(i int32) uint32 { | ||
|
||
x := float64(i) | ||
|
||
bm := sm.spansBitmap & bitmap.Mask[i>>4] | ||
spanIdx := bits.OnesCount64(bm) | ||
|
||
j := spanIdx * polyCoefCnt | ||
p := sm.Polynomials | ||
v := int64(p[j] + p[j+1]*x + p[j+2]*x*x) | ||
|
||
config := sm.Configs[spanIdx] | ||
deltaWidth := config & 0xff | ||
offset := config >> 8 | ||
|
||
bitIdx := offset + int64(i)*deltaWidth | ||
|
||
d := sm.Deltas[bitIdx>>6] | ||
d = d >> uint(bitIdx&63) | ||
|
||
return uint32(v + int64(d&bitmap.Mask[deltaWidth])) | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
|
||
# 场景和问题 | ||
|
||
在时序数据库, 或列存储为基础的系统中, 很常见的形式就是存储一个整数数组, | ||
例如 [slim](https://github.com/openacid/slim) 这个项目按天统计的 star 数: | ||
|
||
 | ||
 | ||
|
||
我们可以利用数据分布的特点, 将整体数据的大小压缩到**几分之一**. | ||
|
||
<table> | ||
<tr class="header"> | ||
<th style="text-align: right;">Data size</th> | ||
<th style="text-align: left;">Data Set</th> | ||
<th style="text-align: right;">gzip size</th> | ||
<th style="text-align: left;">slimarry size</th> | ||
<th style="text-align: right;">avg size</th> | ||
<th style="text-align: right;">ratio</th> | ||
</tr> | ||
<tr class="odd"> | ||
<td style="text-align: right;">1,000</td> | ||
<td style="text-align: left;">rand u32: [0, 1000]</td> | ||
<td style="text-align: right;">x</td> | ||
<td style="text-align: left;">824 byte</td> | ||
<td style="text-align: right;">6 bit/elt</td> | ||
<td style="text-align: right;">18%</td> | ||
</tr> | ||
<tr class="even"> | ||
<td style="text-align: right;">1,000,000</td> | ||
<td style="text-align: left;">rand u32: [0, 1000,000]</td> | ||
<td style="text-align: right;">x</td> | ||
<td style="text-align: left;">702 KB</td> | ||
<td style="text-align: right;">5 bit/elt</td> | ||
<td style="text-align: right;">15%</td> | ||
</tr> | ||
<tr class="odd"> | ||
<td style="text-align: right;">1,000,000</td> | ||
<td style="text-align: left;">IPv4 DB</td> | ||
<td style="text-align: right;">2 MB</td> | ||
<td style="text-align: left;">2 MB</td> | ||
<td style="text-align: right;">16 bit/elt</td> | ||
<td style="text-align: right;">50%</td> | ||
</tr> | ||
<tr class="even"> | ||
<td style="text-align: right;">600</td> | ||
<td style="text-align: left;"><a href="https://github.com/openacid/slim">slim</a> star count</td> | ||
<td style="text-align: right;">602 byte</td> | ||
<td style="text-align: left;">832 byte</td> | ||
<td style="text-align: right;">10 bit/elt</td> | ||
<td style="text-align: right;">26%</td> | ||
</tr> | ||
</table> | ||
|
||
在达到gzip同等压缩率的前提下, 构建 slimarray 和 访问的性能也非常高: | ||
|
||
- 构建 slimarray 时, 平均每秒可压缩 6百万 个数组元素; | ||
- 读取一个数组元素平均花费 7 ns/op. | ||
- 构建 slimarray 时, 平均每秒可压缩 6百万 个数组元素; | ||
- 读取一个数组元素平均花费 `7 ns/op`. | ||
|
||
🤔!!! | ||
|
||
按照这种思路, **在给定数组中找到一条曲线来描述点的趋势,** | ||
**再用一个比较小的delta数组修正曲线到实际点的距离, 得到原始值, 就可以实现大幅度的数据压缩. 而且所有的数据都无需解压全部数据就直接读取任意一个.** | ||
|
||
# 找到趋势函数 | ||
|
||
寻找这样一条曲线就使用线性回归, | ||
例如在 [slimarray](https://github.com/openacid/slimarray) 中使用2次曲线 `f(x) = β₁ + β₂x + β₃x²`, 所要做的就是确定每个βᵢ的值, | ||
以使得`f(xⱼ) - yⱼ`的均方差最小. xⱼ是数组下标0, 1, 2...; yⱼ是数组中每个元素的值. | ||
|
||
<img src="https://www.zhihu.com/equation?tex=X%20%3D%20%5Cbegin%7Bbmatrix%7D1%20%20%20%20%20%20%26%20x_1%20%20%20%20%26%20x_1%5E2%20%5C%5C1%20%20%20%20%20%20%26%20x_2%20%20%20%20%26%20x_2%5E2%20%5C%5C%5Cvdots%20%26%20%5Cvdots%20%26%20%5Cvdots%20%20%20%20%5C%5C1%20%20%20%20%20%20%26%20x_n%20%20%20%20%26%20x_n%5E2%5Cend%7Bbmatrix%7D%2C%5Cvec%7B%5Cbeta%7D%20%3D%5Cbegin%7Bbmatrix%7D%5Cbeta_1%20%5C%5C%5Cbeta_2%20%5C%5C%5Cbeta_3%20%5C%5C%5Cend%7Bbmatrix%7D%2CY%20%3D%5Cbegin%7Bbmatrix%7Dy_1%20%5C%5Cy_2%20%5C%5C%5Cvdots%20%5C%5Cy_n%5Cend%7Bbmatrix%7D%5C%5C" alt="X = \begin{bmatrix}1 & x_1 & x_1^2 \\1 & x_2 & x_2^2 \\\vdots & \vdots & \vdots \\1 & x_n & x_n^2\end{bmatrix},\vec{\beta} =\begin{bmatrix}\beta_1 \\\beta_2 \\\beta_3 \\\end{bmatrix},Y =\begin{bmatrix}y_1 \\y_2 \\\vdots \\y_n\end{bmatrix}\\" class="ee_img tr_noresize" eeimg="1"> | ||
|
||
`spanIndex = OnesCount(bitmap & (1<<(i/16) - 1))` | ||
|
||
## 读取过程 | ||
|
||
读取过程通过找span, 读取span配置,还原原始数据几个步骤完成, 假设 slimarray 的对象是`sa`: | ||
|
||
- 通过下标`i` 得到 spanIndex: `spanIndex = OnesCount(sa.bitmap & (1<<(i/16) - 1))`; | ||
- 通过 spanIndex 得到多项式的3个系数: `[b₀, b₁, b₂] = sa.polynomials[spanIndex: spanIndex + 3]`; | ||
- 读取 delta 数组起始位置, 和 delta 数组中每个 delta 的 bit 宽度: `config=sa.configs[spanIndex]`; | ||
- delta 的值保存在 delta 数组的`config.offset + i*config.width`的位置, 从这个位置读取`width`个 bit 得到 delta 的值. | ||
- 计算 `nums[i]` 的值: `b₀ + b₁*i + b₂*i²` 再加上 delta 的值. | ||
|
||
简化的读取逻辑如下: | ||
|
||
```go | ||
func (sm *SlimArray) Get(i int32) uint32 { | ||
|
||
x := float64(i) | ||
|
||
bm := sm.spansBitmap & bitmap.Mask[i>>4] | ||
spanIdx := bits.OnesCount64(bm) | ||
|
||
j := spanIdx * polyCoefCnt | ||
p := sm.Polynomials | ||
v := int64(p[j] + p[j+1]*x + p[j+2]*x*x) | ||
|
||
config := sm.Configs[spanIdx] | ||
deltaWidth := config & 0xff | ||
offset := config >> 8 | ||
|
||
bitIdx := offset + int64(i)*deltaWidth | ||
|
||
d := sm.Deltas[bitIdx>>6] | ||
d = d >> uint(bitIdx&63) | ||
|
||
return uint32(v + int64(d&bitmap.Mask[deltaWidth])) | ||
} | ||
``` | ||
|
||
|
||
[slim]: https://github.com/openacid/slim "slim" | ||
[slimarray]: https://github.com/openacid/slimarray "slimarray" |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.