Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update lz4 flex #33

Closed
wants to merge 650 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
650 commits
Select commit Hold shift + click to select a range
f8686ab
improve comments
PSeitz Sep 30, 2022
b50e4b7
Merge pull request #1566 from quickwit-oss/fix_docstore_sorting
PSeitz Sep 30, 2022
b4b4f3f
Removing default features for zstd (#1574)
fulmicoton Sep 30, 2022
a695edc
remove dead indexing code
PSeitz Oct 3, 2022
927dff5
Merge pull request #1578 from quickwit-oss/dead_code
PSeitz Oct 3, 2022
a24ae8d
clippy: Fix needless-borrow warnings. (#1581)
waywardmonkeys Oct 3, 2022
2d23763
Use u64::from boolean more. (#1580)
waywardmonkeys Oct 3, 2022
44e0379
Fix warnings when doc'ing private items. (#1579)
waywardmonkeys Oct 3, 2022
a9d2f3d
Tantivy requires Rust 1.62 or later. (#1583)
waywardmonkeys Oct 3, 2022
b062ab2
use groupby instead of vec allocation
PSeitz Oct 3, 2022
0f4a478
Merge pull request #1582 from quickwit-oss/faster_sorted_field_values
PSeitz Oct 4, 2022
6d9a123
remove get_val in serialization
PSeitz Oct 4, 2022
0f5cff7
move enumerate and remove computation
PSeitz Oct 4, 2022
4cf911d
Merge pull request #1587 from quickwit-oss/no_get_val_in_serialize
PSeitz Oct 4, 2022
5945dbf
change format for store to make it faster with small documents (#1569)
trinity-1686a Oct 4, 2022
6d0bb82
Fix issue 1576: serialize bytes as base64 strings
nigel-andrews Oct 4, 2022
e5043d7
added a couple of tests + make fmt
nigel-andrews Oct 4, 2022
0dc8c45
Flaky unit test. (#1592)
fulmicoton Oct 5, 2022
b3bf9a5
Documentation improvements.
waywardmonkeys Oct 5, 2022
2100ec5
Merge pull request #1593 from waywardmonkeys/doc-improvements
PSeitz Oct 5, 2022
7baa6e3
Removing alloc on all .next() in MultiValueColumn
fulmicoton Oct 5, 2022
f60a551
add flat_map_with_buffer to Iterator trait
PSeitz Oct 5, 2022
7905965
Merge pull request #1594 from quickwit-oss/flat_map_with_buffer
PSeitz Oct 5, 2022
8b42c4c
disable linear codec for multivalue value index
PSeitz Oct 5, 2022
b9f06bc
Update src/fastfield/multivalued/mod.rs
PSeitz Oct 5, 2022
d742275
renames
PSeitz Oct 5, 2022
2063f17
Merge pull request #1591 from quickwit-oss/ff_refact
PSeitz Oct 5, 2022
c694bc0
Fix missing doc warnings when enabling feature "quickwit".
waywardmonkeys Oct 4, 2022
c2f1c25
doc: Remove reference to `Searcher` pool. (#1598)
waywardmonkeys Oct 5, 2022
36e1c79
replace cbor with cborium
PSeitz Oct 7, 2022
516e609
remove unwrap
PSeitz Oct 7, 2022
5f565e7
Merge pull request #1604 from quickwit-oss/replace_cbor
PSeitz Oct 7, 2022
400a20b
add ip field
PSeitz Sep 20, 2022
6113e04
remove comment
PSeitz Sep 26, 2022
c8713a0
use iter api
PSeitz Sep 27, 2022
5a76e6c
fix get_between_vals forwarding
PSeitz Sep 28, 2022
309449d
rename to IpAddr
PSeitz Sep 30, 2022
087beaf
remove null handling
PSeitz Sep 30, 2022
eeb1f19
rename to iter_gen
PSeitz Sep 30, 2022
f5039f1
remove roaring
PSeitz Sep 30, 2022
787a37b
expect instead of unwrap
PSeitz Sep 30, 2022
67f453b
rename to iter_gen
PSeitz Sep 30, 2022
cdc8e3a
group montonic mapping and inverse
PSeitz Oct 4, 2022
4d29ff4
finalize ip addr rename
PSeitz Oct 6, 2022
5d6602a
mark null handling TODO
PSeitz Oct 6, 2022
0b86658
rename ip addr, use buffer
PSeitz Oct 6, 2022
e50e74a
remove u128 type
PSeitz Oct 6, 2022
5171ff6
serialize ip as u128, add test for positions_to_docid
PSeitz Oct 6, 2022
2864bf7
use serializer for u128
PSeitz Oct 6, 2022
226a493
add StrictlyMonotonicFn
PSeitz Oct 6, 2022
a8a36b6
enable test
PSeitz Oct 7, 2022
39f4e58
improve comment
PSeitz Oct 7, 2022
9a1609d
add test
PSeitz Oct 7, 2022
96315df
use idx part only for positions_to_docid
PSeitz Oct 7, 2022
f465173
Apply suggestions from code review
PSeitz Oct 7, 2022
534b1d3
use ipv6
PSeitz Oct 7, 2022
b9b9135
fmt
PSeitz Oct 7, 2022
00a6586
Replaced String::serialize for serializer.serialize_str
nigel-andrews Oct 7, 2022
3b18908
Use raw string literals in tests
nigel-andrews Oct 7, 2022
b2ca83a
switch to ipv6, add monotonic_mapping tests
PSeitz Oct 7, 2022
5c9cbee
handle IpV4 serialization case
PSeitz Oct 7, 2022
e443ca6
Merge pull request #1608 from quickwit-oss/nigel/serialise-bytes-as-b…
fmassot Oct 10, 2022
2efebdb
remove tokenstream vec alloc
PSeitz Oct 11, 2022
3650d1f
Merge pull request #1553 from quickwit-oss/ip_field
PSeitz Oct 11, 2022
8b69aab
avoid prepare_doc allocation (#1610)
PSeitz Oct 11, 2022
9cb8cfb
return Error instead panic in fastfields
PSeitz Oct 11, 2022
11d3409
add missing docs for fastfield_codecs crate (#1613)
PSeitz Oct 11, 2022
4b4c231
Merge pull request #1612 from quickwit-oss/no_panic_please
PSeitz Oct 11, 2022
77a415c
rename NothingRecorder to DocIdRecorder (#1615)
PSeitz Oct 13, 2022
07393c2
Attempt to fix race condition in test. (#1619)
fulmicoton Oct 14, 2022
63bc390
Fix missing fieldnorm indexing
PSeitz Oct 14, 2022
4b9d1fe
Merge pull request #1620 from quickwit-oss/fix_fieldnorms_indexing
PSeitz Oct 14, 2022
a602c24
Merge pull request #1590 from waywardmonkeys/fix-doc-warnings-quickwit
PSeitz Oct 14, 2022
84f9e77
update CHANGELOG
PSeitz Oct 14, 2022
80f9596
Merge pull request #1611 from quickwit-oss/remove_token_stream_alloc
PSeitz Oct 14, 2022
952b048
add term aggregation clarification
PSeitz Oct 14, 2022
d2478fa
Merge pull request #1621 from quickwit-oss/changelog
PSeitz Oct 14, 2022
f39cce2
Merge pull request #1622 from quickwit-oss/term_aggregation
PSeitz Oct 14, 2022
129f742
remove unused buffer
PSeitz Oct 14, 2022
6b7b1cc
Merge pull request #1623 from quickwit-oss/remove_unused_buffer
PSeitz Oct 14, 2022
fcfd76e
refactor Term
PSeitz Oct 11, 2022
8d75e45
fix truncate, remove mutable access from term
PSeitz Oct 14, 2022
024e53a
remove truncate
PSeitz Oct 14, 2022
c9cf9c9
Merge pull request #1614 from quickwit-oss/remove_superfluous_steps
PSeitz Oct 17, 2022
6800fde
add indexing for ip field
PSeitz Oct 14, 2022
96c3d54
fix: Fix power of two computation on 32bit architectures (#1624)
theduke Oct 18, 2022
4918541
Merge pull request #1625 from quickwit-oss/index_ip_field
PSeitz Oct 18, 2022
1082ff6
add range query handling for ip via term dictionary
PSeitz Oct 18, 2022
a4485f7
faster skipindex deserialization, larger blocksize on sort
PSeitz Oct 18, 2022
c9235df
Merge pull request #1627 from quickwit-oss/ip_field_range_query
PSeitz Oct 19, 2022
449f595
Merge pull request #1628 from quickwit-oss/skip_index_deser
PSeitz Oct 19, 2022
f2b2628
add test for phrase search on multi text field
PSeitz Oct 19, 2022
94313b6
Hotfix issue/1629 - position broken (#1633)
fulmicoton Oct 20, 2022
8de7fa9
Merge pull request #1631 from quickwit-oss/high_positions
PSeitz Oct 20, 2022
483b1d1
Added unit test for long tokens (#1635)
fulmicoton Oct 20, 2022
7913500
switch num_vals() to u32
PSeitz Oct 20, 2022
873382c
Merge pull request #1639 from quickwit-oss/num_vals_u32
PSeitz Oct 21, 2022
c24157f
Bumping version format. (#1640)
fulmicoton Oct 21, 2022
f2e5135
allow more characters in range query
PSeitz Oct 21, 2022
03885d0
Merge pull request #1643 from quickwit-oss/range_query_parser
PSeitz Oct 24, 2022
6bb73a5
add range query via ip fast field
PSeitz Oct 20, 2022
9b6b6be
Apply suggestions from code review
PSeitz Oct 21, 2022
07b40f8
add proptest
PSeitz Oct 24, 2022
7cc7752
add comments, rename
PSeitz Oct 24, 2022
02328b0
fix proptest
PSeitz Oct 24, 2022
8c2ba7b
Merge pull request #1637 from quickwit-oss/ip_field_range_query
PSeitz Oct 24, 2022
e772d31
switch get_val() to u32
PSeitz Oct 21, 2022
a5e59ab
Merge pull request #1644 from quickwit-oss/get_val_u32
PSeitz Oct 24, 2022
5e159c2
add ip range query benchmark, add seek behaviour
PSeitz Oct 25, 2022
6213ea4
pass positions parameter
PSeitz Oct 25, 2022
fec2b63
improve bench by adding more blanks in compact space
PSeitz Oct 25, 2022
af83975
No score calls if score is not requested
PSeitz Oct 26, 2022
0c2bd36
Panic on duplicate field names (#1647)
PSeitz Oct 26, 2022
dfab201
for_each_docset to iterate without score
PSeitz Oct 26, 2022
5f7d027
Avoid unconditional allocation in StemmerTokenStream.
adamreichold Oct 26, 2022
bbb058d
Replace FNV by rustc-hash
adamreichold Oct 26, 2022
d777c96
Merge pull request #1650 from adamreichold/fnv-rustc-hash
PSeitz Oct 27, 2022
cd95242
Add dictionary-based SplitCompoundWords token filter.
adamreichold Oct 26, 2022
7a80851
Merge pull request #1645 from quickwit-oss/ip_field_range_query
PSeitz Oct 27, 2022
279b1b2
switch to fx hashmap
PSeitz Oct 27, 2022
6647362
Merge pull request #1648 from adamreichold/stemmer-todo-alloc
PSeitz Oct 27, 2022
43df356
rename to docset
PSeitz Oct 27, 2022
4e46f4f
Merge pull request #1649 from adamreichold/split-compound-words
PSeitz Oct 27, 2022
83325d8
move multivalue index to own file
PSeitz Oct 31, 2022
3f3a6f9
Merge pull request #1653 from quickwit-oss/faster_hash
PSeitz Nov 1, 2022
c32ab66
Small improvements to StopWorldFilter (#1657)
adamreichold Nov 1, 2022
2af6b01
Update src/query/boolean_query/boolean_weight.rs
PSeitz Nov 1, 2022
0f98d91
Merge pull request #1646 from quickwit-oss/no_score_calls
PSeitz Nov 1, 2022
a5a80ff
Update fastfield_codecs/src/column.rs
PSeitz Nov 2, 2022
5b2cea1
Merge pull request #1656 from quickwit-oss/multival_offset_index
PSeitz Nov 2, 2022
509a265
add docstore version (#1652)
PSeitz Nov 4, 2022
500a0d5
update criterion to 0.4
PSeitz Nov 4, 2022
5a610ef
Merge pull request #1661 from quickwit-oss/upgrade_criterion
PSeitz Nov 4, 2022
6e636c9
fix num_vals in multivalue index after merge
PSeitz Nov 7, 2022
e948889
Merge pull request #1662 from quickwit-oss/fix_num_vals
PSeitz Nov 7, 2022
38ad46e
fix clippy
PSeitz Nov 7, 2022
666afcf
Merge pull request #1663 from PSeitz/fix_clippy
PSeitz Nov 7, 2022
c69a873
fix num_vals on value index after merge
PSeitz Nov 7, 2022
3e9c806
Merge pull request #1665 from quickwit-oss/fix_num_vals
PSeitz Nov 7, 2022
a4b759d
Include stop word lists from Lucene and the Snowball project (#1666)
adamreichold Nov 9, 2022
8ca12a5
Added stop word filter to CHANGELOG.md
fulmicoton Nov 9, 2022
3edf0a2
Using the manual reload policy in IndexWriter. (#1667)
fulmicoton Nov 9, 2022
9e8a0c2
Allow range query on fastfield without INDEXED
PSeitz Nov 10, 2022
e6acf8f
add header with codec type for u128
PSeitz Nov 10, 2022
3216668
add header deser test
PSeitz Nov 11, 2022
55a9d80
Merge pull request #1674 from quickwit-oss/u128_codec_header
PSeitz Nov 11, 2022
fb9f031
switch total_num_val to u32
PSeitz Nov 11, 2022
5765c26
allow warming up of the full posting list (#1673)
trinity-1686a Nov 14, 2022
3b5f810
Merge pull request #1677 from quickwit-oss/switch_to_u32
PSeitz Nov 14, 2022
c665b16
Merge pull request #1672 from quickwit-oss/allow_range_without_indexed
PSeitz Nov 14, 2022
f811d16
add support for ip range query on multivalue fastfields
PSeitz Nov 2, 2022
e034328
Improve position_to_docid, refactor, add tests
PSeitz Nov 7, 2022
ce10fab
Apply suggestions from code review
PSeitz Nov 14, 2022
b7d0dd1
fmt
PSeitz Nov 14, 2022
9a090ed
Merge pull request #1659 from quickwit-oss/ip_range_query_multi
PSeitz Nov 14, 2022
8641155
remove column from MultiValuedU128FastFieldReader
PSeitz Nov 14, 2022
eda6e5a
Merge pull request #1681 from quickwit-oss/ip_range_query_multi
PSeitz Nov 15, 2022
ca62311
Make the built-in stop word lists selectable via the Language enum al…
adamreichold Nov 15, 2022
2a39289
Handle escaped dot in json path in the QueryParser. (#1682)
fulmicoton Nov 15, 2022
e758080
add support for TermSetQuery in query parser (#1683)
trinity-1686a Nov 17, 2022
0b40a7f
Added a `expand_dots` JsonObjectOptions. (#1687)
fulmicoton Nov 21, 2022
a05c184
Update zstd requirement from 0.11 to 0.12
dependabot[bot] Nov 23, 2022
0281b22
update create_in_ram docs (#1695)
PSeitz Nov 24, 2022
f53e656
Update env_logger requirement from 0.9.0 to 0.10.0
dependabot[bot] Nov 24, 2022
9929c0c
Merge pull request #1696 from quickwit-oss/dependabot/cargo/env_logge…
PSeitz Nov 25, 2022
600548f
Merge pull request #1694 from quickwit-oss/dependabot/cargo/zstd-0.12
PSeitz Nov 25, 2022
ee1f2c1
add aggregation support for date type (#1693)
PSeitz Nov 28, 2022
1119e59
prepare fastfield format for null index (#1691)
PSeitz Nov 28, 2022
485a8f5
Update CHANGELOG.md
PSeitz Nov 28, 2022
4958243
Move `split_full_path` to `Schema` (#1692)
boraarslan Nov 29, 2022
96c93a6
Merge pull request #1700 from quickwit-oss/PSeitz-patch-1
PSeitz Dec 2, 2022
509adab
Bump version (#1715)
PSeitz Dec 12, 2022
2c50b02
Fix max bucket limit in histogram (#1703)
PSeitz Dec 12, 2022
5d4535d
Changelog fix (#1717)
PSeitz Dec 12, 2022
136a8f4
Isolating sstable and stacker in independant crates. (#1718)
fulmicoton Dec 13, 2022
fbb0f8b
Update base64 requirement from 0.13.0 to 0.20.0 (#1720)
dependabot[bot] Dec 13, 2022
3cdc8e7
pass index info to serialize (#1719)
PSeitz Dec 13, 2022
f9971e1
Fixing unit test with sstable test.
fulmicoton Dec 13, 2022
f6e87a5
Cargo fmt
fulmicoton Dec 13, 2022
a2cf6a7
Sparse dense index (#1716)
PSeitz Dec 13, 2022
f9171a3
fix clippy (#1725)
PSeitz Dec 20, 2022
2ac1cc2
add sparse codec (#1723)
PSeitz Dec 20, 2022
4a6bf50
Clippy
fulmicoton Dec 21, 2022
32cb1d2
Removed AsyncIoResult. (#1728)
fulmicoton Dec 21, 2022
f39165e
Moving FileSlice to tantivy-common (#1729)
fulmicoton Dec 21, 2022
3339a3e
Removed feature(quickwit) in tantivy-common.
fulmicoton Dec 22, 2022
bb48c3e
Refactoring to prepare for the addition of dynamic fast field (#1730)
fulmicoton Dec 22, 2022
540a997
Support for NotNaN in fast fields
fulmicoton Dec 21, 2022
2a6d1ea
Added missing license.
fulmicoton Dec 22, 2022
f4804ce
Adjust spelling of "returns" in docs for DisjunctionMaxQuery (#1733)
mhlakhani Dec 22, 2022
13b89cb
Adding inlines.
fulmicoton Dec 22, 2022
7385a8f
Supporting PartialCmp in VectorColumn. (#1735)
fulmicoton Dec 22, 2022
bc95900
Ooops. Removing ordered_floats.
fulmicoton Dec 22, 2022
45156fd
use group_by in translate_codec_idx_to_original_id (#1736)
PSeitz Dec 26, 2022
9948a84
Simplifies the count_ones definition. (#1742)
fulmicoton Dec 26, 2022
9c5fef5
Fixing sstable proptest (#1743)
fulmicoton Dec 26, 2022
3f91592
Fixing unit tests
fulmicoton Dec 27, 2022
b78dc5e
Bump prettytables (#1746)
pinkforest Dec 31, 2022
b22f966
doc: update comments in the faceted search example (#1737)
DawChihLiou Jan 2, 2023
2080c37
Enable usage of FuzzyTermQuery for specific fields via QueryParser (#…
adamreichold Jan 4, 2023
07a51eb
refactor multivalue fastfield, refactor range query (#1749)
PSeitz Jan 5, 2023
1afa5bf
Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery …
adamreichold Jan 6, 2023
4f9efe6
Support for columnar (#1734)
fulmicoton Jan 7, 2023
514d23a
move tokenizer API to seperate crate (#1767)
PSeitz Jan 9, 2023
7c6cc81
enable range query on fast field for u64 compatible types (#1762)
PSeitz Jan 10, 2023
3090d49
Update base64 requirement from 0.20.0 to 0.21.0 (#1769)
dependabot[bot] Jan 10, 2023
82a183b
Bump dependency on lru to from version 0.7.5 to version 0.9.0. (#1755)
adamreichold Jan 10, 2023
196e42f
Add regex tokenizer (#1759)
mkleen Jan 10, 2023
7a8fce0
Minor mini fixes
fulmicoton Jan 10, 2023
8312c88
More cosmetic fixes for upcoming Clippy lints. (#1771)
adamreichold Jan 10, 2023
14222a4
Fix typo (#1776)
guilload Jan 10, 2023
f3621c0
Add license to tokenizer-api crate (#1778)
guilload Jan 11, 2023
e17996f
Allow range queries via fast fields on non-indexed fields
guilload Jan 10, 2023
f8d111a
Merge pull request #1777 from quickwit-oss/guilload/ff-range-query-on…
guilload Jan 11, 2023
1176555
handle user input on get_docid_for_value_range (#1760)
PSeitz Jan 12, 2023
2650111
EnableScoring::Disabled - optional Searcher (#1780)
shikhar Jan 12, 2023
6ca9a47
reuse stats for average (#1785)
PSeitz Jan 13, 2023
16b704e
make file_slice_for_range on sstable public (#1784)
trinity-1686a Jan 16, 2023
4bac945
add ip field example (#1775)
PSeitz Jan 16, 2023
25bad78
Integrated fastfield codecs into columnar. (#1782)
fulmicoton Jan 16, 2023
f2dad19
Add count, min, max, and sum aggregations
guilload Jan 13, 2023
a59bd96
Merge pull request #1794 from quickwit-oss/guilload/count-min-max-sum…
guilload Jan 17, 2023
0caaf13
Remove standard deviation from stats aggregation
guilload Jan 13, 2023
c9cb3d0
Merge pull request #1788 from quickwit-oss/guilload/remove-std-dev-fr…
guilload Jan 17, 2023
c51d9f9
Fix some Clippy warnings
guilload Jan 17, 2023
4b343b3
Merge pull request #1802 from quickwit-oss/guilload/clippy-fixes
guilload Jan 17, 2023
c4af63e
add rename (#1797)
PSeitz Jan 18, 2023
f687b3a
start migrate Field to &str (#1772)
PSeitz Jan 18, 2023
5180b61
Removing the demuxer code (#1799)
fulmicoton Jan 18, 2023
d72ea7d
modify getters for sstable metadata (#1793)
trinity-1686a Jan 18, 2023
c723ed3
Columnar merge (#1806)
fulmicoton Jan 19, 2023
9f42b64
Completed unit test for dictionary encoded column
fulmicoton Jan 19, 2023
f9abd25
add ip addr to columnar (#1805)
PSeitz Jan 19, 2023
a86b104
Differentiating between str and bytes, + unit test
fulmicoton Jan 19, 2023
5a42c5a
Add support for multivalues (#1809)
fulmicoton Jan 19, 2023
e3d504d
Minor code cleanup (#1810)
fulmicoton Jan 19, 2023
a2ca129
update aggregation docs (#1807)
PSeitz Jan 19, 2023
8ba333f
Typo fix (#1803)
lonre Jan 19, 2023
08919a2
Improvement on the scalar / random bitpacker code. (#1781)
fulmicoton Jan 19, 2023
50d8a8b
Update README (#1804)
PSeitz Jan 19, 2023
d09d91a
fix tests (#1813)
PSeitz Jan 19, 2023
89cec79
Make it possible to force a column type and intricate bugfix. (#1815)
fulmicoton Jan 20, 2023
b31fd38
collect columns for merge (#1812)
PSeitz Jan 20, 2023
9a296b2
Renamed dense file to dense.rs
fulmicoton Jan 20, 2023
9548570
Fixing broken test build
fulmicoton Jan 20, 2023
226d0f8
add columnar to workspace (#1808)
PSeitz Jan 20, 2023
cbc70a9
Cargo.toml cleanup (#1817)
PSeitz Jan 20, 2023
2874554
Removed the sorting logic that forced column type to be sorted like (…
fulmicoton Jan 20, 2023
0f20787
fix doc store cache docs (#1821)
PSeitz Jan 23, 2023
33f6f34
fmt code, update lz4_flex
PSeitz Jan 31, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 0 additions & 1 deletion .gitattributes

This file was deleted.

11 changes: 6 additions & 5 deletions .github/workflows/coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@ jobs:
steps:
- uses: actions/checkout@v3
- name: Install Rust
run: rustup toolchain install nightly --component llvm-tools-preview
- name: Install cargo-llvm-cov
run: curl -LsSf https://github.com/taiki-e/cargo-llvm-cov/releases/latest/download/cargo-llvm-cov-x86_64-unknown-linux-gnu.tar.gz | tar xzf - -C ~/.cargo/bin
run: rustup toolchain install nightly --profile minimal --component llvm-tools-preview
- uses: Swatinem/rust-cache@v2
- uses: taiki-e/install-action@cargo-llvm-cov
- name: Generate code coverage
run: cargo llvm-cov --all-features --workspace --lcov --output-path lcov.info
run: cargo +nightly llvm-cov --all-features --workspace --lcov --output-path lcov.info
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v2
uses: codecov/codecov-action@v3
continue-on-error: true
with:
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
files: lcov.info
Expand Down
16 changes: 10 additions & 6 deletions .github/workflows/long_running.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,20 @@ env:
NUM_FUNCTIONAL_TEST_ITERATIONS: 20000

jobs:
functional_test_unsorted:
test:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Install stable
uses: actions-rs/toolchain@v1
with:
toolchain: stable
profile: minimal
override: true

- name: Run indexing_unsorted
run: cargo test indexing_unsorted -- --ignored
functional_test_sorted:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run indexing_sorted
run: cargo test indexing_sorted -- --ignored

53 changes: 39 additions & 14 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,27 @@ env:
CARGO_TERM_COLOR: always

jobs:
test:
check:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Build
run: cargo build --verbose --workspace
- name: Install latest nightly to test also against unstable feature flag

- name: Install nightly
uses: actions-rs/toolchain@v1
with:
toolchain: nightly
override: true
profile: minimal
components: rustfmt

- name: Install latest nightly to test also against unstable feature flag
- name: Install stable
uses: actions-rs/toolchain@v1
with:
toolchain: stable
override: true
components: rustfmt, clippy

- name: Run tests
run: cargo +stable test --features mmap,brotli-compression,lz4-compression,snappy-compression,failpoints --verbose --workspace
profile: minimal
components: clippy

- name: Run tests quickwit feature
run: cargo +stable test --features mmap,quickwit,failpoints --verbose --workspace
- uses: Swatinem/rust-cache@v2

- name: Check Formatting
run: cargo +nightly fmt --all -- --check
Expand All @@ -47,3 +41,34 @@ jobs:
token: ${{ secrets.GITHUB_TOKEN }}
args: --tests

test:

runs-on: ubuntu-latest

strategy:
matrix:
features: [
{ label: "all", flags: "mmap,stopwords,brotli-compression,lz4-compression,snappy-compression,zstd-compression,failpoints" },
{ label: "quickwit", flags: "mmap,quickwit,failpoints" }
]

name: test-${{ matrix.features.label}}

steps:
- uses: actions/checkout@v3

- name: Install stable
uses: actions-rs/toolchain@v1
with:
toolchain: stable
profile: minimal
override: true

- uses: taiki-e/install-action@nextest
- uses: Swatinem/rust-cache@v2

- name: Run tests
run: cargo +stable nextest run --features ${{ matrix.features.flags }} --verbose --workspace

- name: Run doctests
run: cargo +stable test --doc --features ${{ matrix.features.flags }} --verbose --workspace
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ target/release
Cargo.lock
benchmark
.DS_Store
cpp/simdcomp/bitpackingbenchmark
*.bk
.idea
trace.dat
Expand Down
44 changes: 22 additions & 22 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Tantivy's bread and butter is to address the problem of full-text search :
Given a large set of textual documents, and a text query, return the K-most relevant documents in a very efficient way. To execute these queries rapidly, the tantivy needs to build an index beforehand. The relevance score implemented in the tantivy is not configurable. Tantivy uses the same score as the default similarity used in Lucene / Elasticsearch, called [BM25](https://en.wikipedia.org/wiki/Okapi_BM25).

But tantivy's scope does not stop there. Numerous features are required to power rich-search applications. For instance, one may want to:

- compute the count of documents matching a query in the different section of an e-commerce website,
- display an average price per meter square for a real estate search engine,
- take into account historical user data to rank documents in a specific way,
Expand All @@ -22,27 +23,28 @@ rapidly select all documents matching a given predicate (also known as a query)
collect some information about them ([See collector](#collector-define-what-to-do-with-matched-documents)).

Roughly speaking the design is following these guiding principles:

- Search should be O(1) in memory.
- Indexing should be O(1) in memory. (In practice it is just sublinear)
- Search should be as fast as possible

This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.

## [core/](src/core): Index, segments, searchers.
## [core/](src/core): Index, segments, searchers

Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.

This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.

### Index and Segments...
### Index and Segments

A tantivy index is a collection of smaller independent immutable segments.
A tantivy index is a collection of smaller independent immutable segments.
Each segment contains its own independent set of data structures.

A segment is identified by a segment id that is in fact a UUID.
The file of a segment has the format

```segment-id . ext ```
```segment-id . ext```

The extension signals which data structure (or [`SegmentComponent`](src/core/segment_component.rs)) is stored in the file.

Expand All @@ -52,17 +54,15 @@ On commit, one segment per indexing thread is written to disk, and the `meta.jso

For a better idea of how indexing works, you may read the [following blog post](https://fulmicoton.com/posts/behold-tantivy-part2/).


### Deletes

Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.

On commit, tantivy will find all of the segments with documents matching this existing term and create a [tombstone file](src/fastfield/delete.rs) that represents the bitset of the document that are deleted.
Like all segment files, this file is immutable. Because it is possible to have more than one tombstone file at a given instant, the tombstone filename has the format ``` segment_id . commit_opstamp . del```.
On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.

An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.


### DocId

Within a segment, all documents are identified by a DocId that ranges within `[0, max_doc)`.
Expand All @@ -74,6 +74,7 @@ The DocIds are simply allocated in the order documents are added to the index.

In separate threads, tantivy's index writer search for opportunities to merge segments.
The point of segment merge is to:

- eventually get rid of tombstoned documents
- reduce the otherwise ever-growing number of segments.

Expand All @@ -94,7 +95,7 @@ called [`Directory`](src/directory/directory.rs).
Contrary to Lucene however, "files" are quite different from some kind of `io::Read` object.
Check out [`src/directory/directory.rs`](src/directory/directory.rs) trait for more details.

Tantivy ships two main directory implementation: the `MMapDirectory` and the `RAMDirectory`,
Tantivy ships two main directory implementation: the `MmapDirectory` and the `RamDirectory`,
but users can extend tantivy with their own implementation.

## [schema/](src/schema): What are documents?
Expand All @@ -104,6 +105,7 @@ Tantivy's document follows a very strict schema, decided before building any ind
The schema defines all of the fields that the indexes [`Document`](src/schema/document.rs) may and should contain, their types (`text`, `i64`, `u64`, `Date`, ...) as well as how it should be indexed / represented in tantivy.

Depending on the type of the field, you can decide to

- put it in the docstore
- store it as a fast field
- index it
Expand All @@ -117,9 +119,10 @@ As of today, tantivy's schema imposes a 1:1 relationship between a field that is

This is not something tantivy supports, and it is up to the user to duplicate field / concatenate fields before feeding them to tantivy.

## General information about these data structures.
## General information about these data structures

All data structures in tantivy, have:

- a writer
- a serializer
- a reader
Expand All @@ -132,7 +135,7 @@ This conversion is done by the serializer.
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
In tantivy, readers are designed to require very little anonymous memory. The data is read straight from an mmapped file, and loading an index is as fast as mmapping its files.

## [store/](src/store): Here is my DocId, Gimme my document!
## [store/](src/store): Here is my DocId, Gimme my document

The docstore is a row-oriented storage that, for each document, stores a subset of the fields
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
Expand All @@ -146,6 +149,7 @@ Once the top 10 documents have been identified, we fetch them from the store, an
**Not useful for**

Fetching a document from the store is typically a "slow" operation. It usually consists in

- searching into a compact tree-like data structure to find the position of the right block.
- decompressing a small block
- returning the document from this block.
Expand All @@ -154,16 +158,15 @@ It is NOT meant to be called for every document matching a query.

As a rule of thumb, if you hit the docstore more than 100 times per search query, you are probably misusing tantivy.


## [fastfield/](src/fastfield): Here is my DocId, Gimme my value!
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value

Fast fields are stored in a column-oriented storage that allows for random access.
The only compression applied is bitpacking. The column comes with two meta data.
The minimum value in the column and the number of bits per doc.

Fetching a value for a `DocId` is then as simple as computing

```
```rust
min_value + fetch_bits(num_bits * doc_id..num_bits * (doc_id+1))
```

Expand All @@ -190,7 +193,7 @@ For advanced search engine, it is possible to store all of the features required

Finally facets are a specific kind of fast field, and the associated source code is in [`fastfield/facet_reader.rs`](src/fastfield/facet_reader.rs).

# The inverted search index.
# The inverted search index

The inverted index is the core part of full-text search.
When presented a new document with the text field "Hello, happy tax payer!", tantivy breaks it into a list of so-called tokens. In addition to just splitting these strings into tokens, it might also do different kinds of operations like dropping the punctuation, converting the character to lowercase, apply stemming, etc. Tantivy makes it possible to configure the operations to be applied in the schema (tokenizer/ is the place where these operations are implemented).
Expand All @@ -215,19 +218,18 @@ The inverted index actually consists of two data structures chained together.

Where [TermInfo](src/postings/term_info.rs) is an object containing some meta data about a term.


## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)!
## [termdict/](src/termdict): Here is a term, give me the [TermInfo](src/postings/term_info.rs)

Tantivy's term dictionary is mainly in charge of supplying the function

[Term](src/schema/term.rs) ⟶ [TermInfo](src/postings/term_info.rs)

It is itself broken into two parts.

- [Term](src/schema/term.rs) ⟶ [TermOrdinal](src/termdict/mod.rs) is addressed by a finite state transducer, implemented by the fst crate.
- [TermOrdinal](src/termdict/mod.rs) ⟶ [TermInfo](src/postings/term_info.rs) is addressed by the term info store.


## [postings/](src/postings): Iterate over documents... very fast!
## [postings/](src/postings): Iterate over documents... very fast

A posting list makes it possible to store a sorted list of doc ids and for each doc store
a term frequency as well.
Expand All @@ -249,15 +251,14 @@ For instance, when the phrase query "the art of war" does not match "the war of
To make it possible, it is possible to specify in the schema that a field should store positions in addition to being indexed.

The token positions of all of the terms are then stored in a separate file with the extension `.pos`.
The [TermInfo](src/postings/term_info.rs) gives an offset (expressed in position this time) in this file. As we iterate throught the docset,
The [TermInfo](src/postings/term_info.rs) gives an offset (expressed in position this time) in this file. As we iterate through the docset,
we advance the position reader by the number of term frequencies of the current document.

## [fieldnorms/](src/fieldnorms): Here is my doc, how many tokens in this field?

The [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) formula also requires to know the number of tokens stored in a specific field for a given document. We store this information on one byte per document in the fieldnorm.
The fieldnorm is therefore compressed. Values up to 40 are encoded unchanged.


## [tokenizer/](src/tokenizer): How should we process text?

Text processing is key to a good search experience.
Expand All @@ -268,7 +269,6 @@ Text processing can be configured by selecting an off-the-shelf [`Tokenizer`](./

Tantivy's comes with few tokenizers, but external crates are offering advanced tokenizers, such as [Lindera](https://crates.io/crates/lindera) for Japanese.


## [query/](src/query): Define and compose queries

The [Query](src/query/query.rs) trait defines what a query is.
Expand Down
Loading