builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

jordanlewis · 2022-12-02T22:38:52Z

Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.

cockroach-teamcity · 2022-12-02T22:39:00Z

This change is

rafiss

cool how we're growing this functionality!

can we adapt more tests from PG? see the following files

src/test/regress/expected/json.out
src/test/regress/expected/tsdicts.out
src/test/regress/expected/tstypes.out
src/test/regress/expected/tsearch.out
src/test/regress/expected/jsonb.out

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)

pkg/sql/sem/builtins/tsearch_builtins.go line 51 at r1 (raw file):

			Info: "Converts text to a tsvector, normalizing words according to the specified or default configuration. " +
				"Position information is included in the result.",
			Volatility: volatility.Immutable,

in postgres, this is stable (as are all the other ones you added). any reason to make ours different?

pkg/sql/sem/builtins/tsearch_builtins.go line 89 at r1 (raw file):

			},
			Info: "Converts text to a tsquery, normalizing words according to the specified or default configuration." +
				" The & operator inserted between each token in the input.",

missing word in description?

pkg/sql/sem/builtins/tsearch_builtins.go line 108 at r1 (raw file):

			},
			Info: "Converts text to a tsquery, normalizing words according to the specified or default configuration." +
				" The <-> operator inserted between each token in the input.",

missing word?

pkg/util/tsearch/tsquery.go line 375 at r1 (raw file):

		switch len(lexemeTokens) {
		case 0:
			// TODO(jordan): we need to do different handling here. We found a stop

does this mean there's a correctness issue? i don't think we should merge it then. or at least, make a loud error rather than a silent correctness bug

Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively, and the ts_parse builtin which is used to debug the text search parser.

jordanlewis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

pkg/sql/sem/builtins/tsearch_builtins.go line 51 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

in postgres, this is stable (as are all the other ones you added). any reason to make ours different?

The particular overloads of the builtins in this file all should be immutable (and are actually so on Postgres). There are a few overloads for each builtin - the immutable one is the one that does have the config passed in as the first parameter. There are also overloads (that we'll add later) that are stable, these are the ones that pick up the configuration from the session variable)

pkg/sql/sem/builtins/tsearch_builtins.go line 89 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

missing word in description?

Done.

pkg/sql/sem/builtins/tsearch_builtins.go line 108 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

missing word?

Done.

pkg/util/tsearch/tsquery.go line 375 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

does this mean there's a correctness issue? i don't think we should merge it then. or at least, make a loud error rather than a silent correctness bug

No, this doesn't come into play until we add configurations that have stop words. That will come soon but we'll fix it then!

jordanlewis · 2023-02-25T04:35:51Z

One thing to note - this PR doesn't get us full compatibility with the "Postgres text search tokenizer", the thing that is exposed via ts_parse, which is essentially the first pass of text search document processing. For reference, the phases are:

Parse the input document into "tokens" (this phase separates text by whitespace and so on)
"Lexize" each token into lexemes (this phase does stemming + stopword elimination which will come soon in another PR)
Put all of the lexemes into a tsvector or tsquery for storage or retrieval

The Postgres parser is really complex and detailed, and I don't believe it's worth it to support the full parser at this time. Here is the list of tokens the parser can output: https://www.postgresql.org/docs/current/textsearch-parsers.html This PR only outputs one token type, ascii.

It may sound like sacrilege to not merge with full compatibility, but I think in this case it's okay. Reasons:

Text search isn't expected to be "precise" like other database facilities
Many of the token types seem fairly exotic, like XML tag / entity
The main functionality of breaking up text and stemming it will be preserved
The token types are exposed for use with configurable / plugin dicitionaries, which we won't support

If you disagree, I still want to merge this as-is because I want to get going merging the rest of this functionality. We can come back and add the missing token types and so on later. I've filed the missing token types issue here: #97669

rafiss

that sounds good to me. there's lots of good functionality in this

jordanlewis · 2023-02-28T16:02:57Z

Thanks for the review!

bors r+

craig · 2023-02-28T16:38:00Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-02-28T16:44:35Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-02-28T17:34:26Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-02-28T18:41:02Z

Build succeeded:

Bazel Essential CI (Cockroach)

97677: tsearch: add stemming and stopword elimination for several languages r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 First commit is #92966. This commit adds stopword elimination for text search. The languages supported are the same ones that Postgres does. The stopword lists were copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72. Also, add snowball stemming provided by the blevesearch snowball stemming library. Release note (sql change): add stemming and stopword eliminating text search configurations for English, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, and Turkish. 98778: cli: unskip test tenant zip test r=dhartunian a=aadityasondhi This patch unskips and re-records the datadriven `TestTenantZip` as it was fixed in #96553, but was not unskipped or recorded. The test was run locally using `--stress` and did not flake: ``` 101 runs so far, 0 failures, over 5m0s ``` Fixes #87141 Release note: None 98830: sqlinstance: add `binary_version` column to instances table r=knz,JeffSwenson a=healthy-pod This code change adds a `binary_version` column to the instances table. This is done by adding the column to the bootstrap schema for system.sql_instances, and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration that overwrites the live schema for this table by the bootstrap copy. This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible and is only possible because this commit is merged before the v23.1 branch is cut. Release note: None Epic: [CRDB-20860](https://cockroachlabs.atlassian.net/browse/CRDB-20860) 99100: kvserver: skip `TestReplicateQueueExpirationLeasesOnly` under deadlock r=erikgrinaker a=erikgrinaker I give up. Resolves #99015. Epic: none Release note: None 99106: server: avoid some log spam r=erikgrinaker a=knz This change removes the following log spam: ``` could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER ``` Epic: CRDB-23559 Release note: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966 and #97677. This commit adds the default_text_search_config variable for the tsearch package, which allows the user to set a default configuration for the text search builtin functions that take configurations, such as to_tsvector and to_tsquery. The default for this configuration variable is 'english', as it is in Postgres. Release note (sql change): add the default_text_search_config variable for compatibility with the single-argument variants of the text search functions to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery, which use the value of default_text_search_config instead of expecting one to be included as in the two-argument variants. The default value of this setting is 'english'. 99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens This commit sets a new 30m timeout for all disk stall roachtests. Previously, the FUSE filesystem variants had no timeout and inherited the default 10h timeout. The other variants had a 20m timeout, which has been observed to be too short due to upreplication latency. Informs #98904. Informs #98886. Epic: None Release note: None 99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373 Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`. This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns. Co-authored-by: [email protected] Fixes: #99000 Epic: None Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: craig[bot] <[email protected]>

97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966, #97677, and #97685 This commit adds ts_rank, the family of builtins that allow ranking of text search results. The function takes a tsquery and a tsvector and returns a float that indicates how good the match is. The function can be modified by passing in a custom array of weights that matches the text search weights A, B, C, and D, and a bitmask that controls the ranking behavior in various detailed ways. See the excellent Postgres documentation here for details: https://www.postgresql.org/docs/current/textsearch-controls.html Release note (sql change): add the ts_rank function for ranking text search query results 98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two. Epic: None Release note: None 99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds. It also fixes the race possible in `TestCheckConsistencyInconsistent`: - Node 2 is corrupted. - The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. The test now waits for the moment when all the checkpoint are known to be fully populated. Fixes #81819 Epic: none Release note: none Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

jordanlewis requested a review from a team December 2, 2022 22:38

jordanlewis force-pushed the to-tsvector branch from 83cd51a to 3446576 Compare December 3, 2022 00:09

rafiss requested changes Dec 5, 2022

View reviewed changes

jordanlewis force-pushed the to-tsvector branch 2 times, most recently from 3740129 to dc2b668 Compare December 29, 2022 21:07

jordanlewis requested a review from rafiss February 25, 2023 04:24

jordanlewis commented Feb 25, 2023

View reviewed changes

jordanlewis force-pushed the to-tsvector branch from dc2b668 to 67179af Compare February 25, 2023 04:24

This was referenced Feb 25, 2023

tsearch: add stemming and stopword elimination for several languages #97677

Merged

sql: add default_text_search_config #97685

Merged

tsearch: add ts_rank functionality #97697

Merged

rafiss approved these changes Feb 28, 2023

View reviewed changes

craig bot merged commit 3ed2c8f into cockroachdb:master Feb 28, 2023

cockroach-teamcity mentioned this pull request Mar 1, 2023

PR #92966 - builtins: add to_tsvector, {phrase,plain,}to_tsquery, ts_parse cockroachdb/docs#16395

Open

jordanlewis deleted the to-tsvector branch March 14, 2023 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

jordanlewis commented Dec 2, 2022 •

edited

Loading

cockroach-teamcity commented Dec 2, 2022

rafiss left a comment

jordanlewis left a comment

jordanlewis commented Feb 25, 2023

rafiss left a comment

jordanlewis commented Feb 28, 2023

craig bot commented Feb 28, 2023

craig bot commented Feb 28, 2023

craig bot commented Feb 28, 2023

craig bot commented Feb 28, 2023

builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

Conversation

jordanlewis commented Dec 2, 2022 • edited Loading

cockroach-teamcity commented Dec 2, 2022

rafiss left a comment

Choose a reason for hiding this comment

jordanlewis left a comment

Choose a reason for hiding this comment

jordanlewis commented Feb 25, 2023

rafiss left a comment

Choose a reason for hiding this comment

jordanlewis commented Feb 28, 2023

craig bot commented Feb 28, 2023

craig bot commented Feb 28, 2023

craig bot commented Feb 28, 2023

craig bot commented Feb 28, 2023

jordanlewis commented Dec 2, 2022 •

edited

Loading