-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builtins: add to_tsvector {phrase,plain,}to_tsquery #92966
Conversation
83cd51a
to
3446576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool how we're growing this functionality!
can we adapt more tests from PG? see the following files
src/test/regress/expected/json.out
src/test/regress/expected/tsdicts.out
src/test/regress/expected/tstypes.out
src/test/regress/expected/tsearch.out
src/test/regress/expected/jsonb.out
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/sem/builtins/tsearch_builtins.go
line 51 at r1 (raw file):
Info: "Converts text to a tsvector, normalizing words according to the specified or default configuration. " + "Position information is included in the result.", Volatility: volatility.Immutable,
in postgres, this is stable (as are all the other ones you added). any reason to make ours different?
pkg/sql/sem/builtins/tsearch_builtins.go
line 89 at r1 (raw file):
}, Info: "Converts text to a tsquery, normalizing words according to the specified or default configuration." + " The & operator inserted between each token in the input.",
missing word in description?
pkg/sql/sem/builtins/tsearch_builtins.go
line 108 at r1 (raw file):
}, Info: "Converts text to a tsquery, normalizing words according to the specified or default configuration." + " The <-> operator inserted between each token in the input.",
missing word?
pkg/util/tsearch/tsquery.go
line 375 at r1 (raw file):
switch len(lexemeTokens) { case 0: // TODO(jordan): we need to do different handling here. We found a stop
does this mean there's a correctness issue? i don't think we should merge it then. or at least, make a loud error rather than a silent correctness bug
3740129
to
dc2b668
Compare
Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively, and the ts_parse builtin which is used to debug the text search parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)
pkg/sql/sem/builtins/tsearch_builtins.go
line 51 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
in postgres, this is stable (as are all the other ones you added). any reason to make ours different?
The particular overloads of the builtins in this file all should be immutable (and are actually so on Postgres). There are a few overloads for each builtin - the immutable one is the one that does have the config passed in as the first parameter. There are also overloads (that we'll add later) that are stable, these are the ones that pick up the configuration from the session variable)
pkg/sql/sem/builtins/tsearch_builtins.go
line 89 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
missing word in description?
Done.
pkg/sql/sem/builtins/tsearch_builtins.go
line 108 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
missing word?
Done.
pkg/util/tsearch/tsquery.go
line 375 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
does this mean there's a correctness issue? i don't think we should merge it then. or at least, make a loud error rather than a silent correctness bug
No, this doesn't come into play until we add configurations that have stop words. That will come soon but we'll fix it then!
dc2b668
to
67179af
Compare
One thing to note - this PR doesn't get us full compatibility with the "Postgres text search tokenizer", the thing that is exposed via
The Postgres parser is really complex and detailed, and I don't believe it's worth it to support the full parser at this time. Here is the list of tokens the parser can output: https://www.postgresql.org/docs/current/textsearch-parsers.html This PR only outputs one token type, It may sound like sacrilege to not merge with full compatibility, but I think in this case it's okay. Reasons:
If you disagree, I still want to merge this as-is because I want to get going merging the rest of this functionality. We can come back and add the missing token types and so on later. I've filed the missing token types issue here: #97669 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds good to me. there's lots of good functionality in this
Thanks for the review! bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
97677: tsearch: add stemming and stopword elimination for several languages r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 First commit is #92966. This commit adds stopword elimination for text search. The languages supported are the same ones that Postgres does. The stopword lists were copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72. Also, add snowball stemming provided by the blevesearch snowball stemming library. Release note (sql change): add stemming and stopword eliminating text search configurations for English, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, and Turkish. 98778: cli: unskip test tenant zip test r=dhartunian a=aadityasondhi This patch unskips and re-records the datadriven `TestTenantZip` as it was fixed in #96553, but was not unskipped or recorded. The test was run locally using `--stress` and did not flake: ``` 101 runs so far, 0 failures, over 5m0s ``` Fixes #87141 Release note: None 98830: sqlinstance: add `binary_version` column to instances table r=knz,JeffSwenson a=healthy-pod This code change adds a `binary_version` column to the instances table. This is done by adding the column to the bootstrap schema for system.sql_instances, and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration that overwrites the live schema for this table by the bootstrap copy. This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible and is only possible because this commit is merged before the v23.1 branch is cut. Release note: None Epic: [CRDB-20860](https://cockroachlabs.atlassian.net/browse/CRDB-20860) 99100: kvserver: skip `TestReplicateQueueExpirationLeasesOnly` under deadlock r=erikgrinaker a=erikgrinaker I give up. Resolves #99015. Epic: none Release note: None 99106: server: avoid some log spam r=erikgrinaker a=knz This change removes the following log spam: ``` could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER ``` Epic: CRDB-23559 Release note: None Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966 and #97677. This commit adds the default_text_search_config variable for the tsearch package, which allows the user to set a default configuration for the text search builtin functions that take configurations, such as to_tsvector and to_tsquery. The default for this configuration variable is 'english', as it is in Postgres. Release note (sql change): add the default_text_search_config variable for compatibility with the single-argument variants of the text search functions to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery, which use the value of default_text_search_config instead of expecting one to be included as in the two-argument variants. The default value of this setting is 'english'. 99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens This commit sets a new 30m timeout for all disk stall roachtests. Previously, the FUSE filesystem variants had no timeout and inherited the default 10h timeout. The other variants had a 20m timeout, which has been observed to be too short due to upreplication latency. Informs #98904. Informs #98886. Epic: None Release note: None 99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373 Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`. This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns. Co-authored-by: [email protected] Fixes: #99000 Epic: None Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: craig[bot] <[email protected]>
97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis Updates: #41288 Epic: CRDB-22357 All but the last commit are from #92966, #97677, and #97685 This commit adds ts_rank, the family of builtins that allow ranking of text search results. The function takes a tsquery and a tsvector and returns a float that indicates how good the match is. The function can be modified by passing in a custom array of weights that matches the text search weights A, B, C, and D, and a bitmask that controls the ranking behavior in various detailed ways. See the excellent Postgres documentation here for details: https://www.postgresql.org/docs/current/textsearch-controls.html Release note (sql change): add the ts_rank function for ranking text search query results 98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two. Epic: None Release note: None 99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds. It also fixes the race possible in `TestCheckConsistencyInconsistent`: - Node 2 is corrupted. - The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. The test now waits for the moment when all the checkpoint are known to be fully populated. Fixes #81819 Epic: none Release note: none Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
Updates: #41288
Epic: CRDB-22357
Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.