Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builtins: add to_tsvector {phrase,plain,}to_tsquery #92966

Merged
merged 1 commit into from
Feb 28, 2023

Conversation

jordanlewis
Copy link
Member

@jordanlewis jordanlewis commented Dec 2, 2022

Updates: #41288
Epic: CRDB-22357

Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.

@jordanlewis jordanlewis requested a review from a team December 2, 2022 22:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool how we're growing this functionality!

can we adapt more tests from PG? see the following files

src/test/regress/expected/json.out
src/test/regress/expected/tsdicts.out
src/test/regress/expected/tstypes.out
src/test/regress/expected/tsearch.out
src/test/regress/expected/jsonb.out

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/sem/builtins/tsearch_builtins.go line 51 at r1 (raw file):

			Info: "Converts text to a tsvector, normalizing words according to the specified or default configuration. " +
				"Position information is included in the result.",
			Volatility: volatility.Immutable,

in postgres, this is stable (as are all the other ones you added). any reason to make ours different?


pkg/sql/sem/builtins/tsearch_builtins.go line 89 at r1 (raw file):

			},
			Info: "Converts text to a tsquery, normalizing words according to the specified or default configuration." +
				" The & operator inserted between each token in the input.",

missing word in description?


pkg/sql/sem/builtins/tsearch_builtins.go line 108 at r1 (raw file):

			},
			Info: "Converts text to a tsquery, normalizing words according to the specified or default configuration." +
				" The <-> operator inserted between each token in the input.",

missing word?


pkg/util/tsearch/tsquery.go line 375 at r1 (raw file):

		switch len(lexemeTokens) {
		case 0:
			// TODO(jordan): we need to do different handling here. We found a stop

does this mean there's a correctness issue? i don't think we should merge it then. or at least, make a loud error rather than a silent correctness bug

@jordanlewis jordanlewis force-pushed the to-tsvector branch 2 times, most recently from 3740129 to dc2b668 Compare December 29, 2022 21:07
Release note (sql change): add the to_tsvector, to_tsquery,
phraseto_tsquery, plainto_tsquery builtins which parse input
documents into tsvectors and tsqueries respectively, and the ts_parse
builtin which is used to debug the text search parser.
@jordanlewis jordanlewis requested a review from rafiss February 25, 2023 04:24
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/sem/builtins/tsearch_builtins.go line 51 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

in postgres, this is stable (as are all the other ones you added). any reason to make ours different?

The particular overloads of the builtins in this file all should be immutable (and are actually so on Postgres). There are a few overloads for each builtin - the immutable one is the one that does have the config passed in as the first parameter. There are also overloads (that we'll add later) that are stable, these are the ones that pick up the configuration from the session variable)


pkg/sql/sem/builtins/tsearch_builtins.go line 89 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

missing word in description?

Done.


pkg/sql/sem/builtins/tsearch_builtins.go line 108 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

missing word?

Done.


pkg/util/tsearch/tsquery.go line 375 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

does this mean there's a correctness issue? i don't think we should merge it then. or at least, make a loud error rather than a silent correctness bug

No, this doesn't come into play until we add configurations that have stop words. That will come soon but we'll fix it then!

@jordanlewis
Copy link
Member Author

One thing to note - this PR doesn't get us full compatibility with the "Postgres text search tokenizer", the thing that is exposed via ts_parse, which is essentially the first pass of text search document processing. For reference, the phases are:

  1. Parse the input document into "tokens" (this phase separates text by whitespace and so on)
  2. "Lexize" each token into lexemes (this phase does stemming + stopword elimination which will come soon in another PR)
  3. Put all of the lexemes into a tsvector or tsquery for storage or retrieval

The Postgres parser is really complex and detailed, and I don't believe it's worth it to support the full parser at this time. Here is the list of tokens the parser can output: https://www.postgresql.org/docs/current/textsearch-parsers.html This PR only outputs one token type, ascii.

It may sound like sacrilege to not merge with full compatibility, but I think in this case it's okay. Reasons:

  1. Text search isn't expected to be "precise" like other database facilities
  2. Many of the token types seem fairly exotic, like XML tag / entity
  3. The main functionality of breaking up text and stemming it will be preserved
  4. The token types are exposed for use with configurable / plugin dicitionaries, which we won't support

If you disagree, I still want to merge this as-is because I want to get going merging the rest of this functionality. We can come back and add the missing token types and so on later. I've filed the missing token types issue here: #97669

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good to me. there's lots of good functionality in this

@jordanlewis
Copy link
Member Author

Thanks for the review!

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 28, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 28, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 28, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 28, 2023

Build succeeded:

craig bot pushed a commit that referenced this pull request Mar 21, 2023
97677: tsearch: add stemming and stopword elimination for several languages r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357
First commit is #92966.

    This commit adds stopword elimination for text search. The languages
    supported are the same ones that Postgres does. The stopword lists were
    copied from Postgres commit e757080e041214cf6983e3e77ef01e83f1371d72.

    Also, add snowball stemming provided by the blevesearch snowball
    stemming library.

    Release note (sql change): add stemming and stopword eliminating text
    search configurations for English, Danish, Dutch, Finnish, French,
    German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish,
    Swedish, and Turkish.

98778: cli: unskip test tenant zip test r=dhartunian a=aadityasondhi

This patch unskips and re-records the datadriven `TestTenantZip` as it was fixed in #96553, but was not unskipped or recorded. The test was run locally using `--stress` and did not flake:
```
101 runs so far, 0 failures, over 5m0s
```

Fixes #87141

Release note: None

98830: sqlinstance: add `binary_version` column to instances table r=knz,JeffSwenson a=healthy-pod

This code change adds a `binary_version` column to the instances table.

This is done by adding the column to the bootstrap schema for system.sql_instances,
and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration
that overwrites the live schema for this table by the bootstrap copy.

This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible
and is only possible because this commit is merged before the v23.1 branch is cut.

Release note: None
Epic: [CRDB-20860](https://cockroachlabs.atlassian.net/browse/CRDB-20860)

99100: kvserver: skip `TestReplicateQueueExpirationLeasesOnly` under deadlock r=erikgrinaker a=erikgrinaker

I give up.

Resolves #99015.

Epic: none
Release note: None

99106: server: avoid some log spam r=erikgrinaker a=knz

This change removes the following log spam:
```
could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
```

Epic: CRDB-23559
Release note: None

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
craig bot pushed a commit that referenced this pull request Mar 21, 2023
97685: sql: add default_text_search_config r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357

All but the last commit are from #92966 and #97677.


    This commit adds the default_text_search_config variable for the tsearch
    package, which allows the user to set a default configuration for the
    text search builtin functions that take configurations, such as
    to_tsvector and to_tsquery. The default for this configuration variable
    is 'english', as it is in Postgres.

    Release note (sql change): add the default_text_search_config variable
    for compatibility with the single-argument variants of the text search
    functions to_tsvector, to_tsquery, phraseto_tsquery, and
    plainto_tsquery, which use the value of default_text_search_config
    instead of expecting one to be included as in the two-argument variants.
    The default value of this setting is 'english'.

99045: roachtest: set 30m timeout for all disk stall roachtests r=nicktrav a=jbowens

This commit sets a new 30m timeout for all disk stall roachtests. Previously,
the FUSE filesystem variants had no timeout and inherited the default 10h
timeout. The other variants had a 20m timeout, which has been observed to be
too short due to upreplication latency.

Informs #98904.
Informs #98886.
Epic: None
Release note: None


99057: sql: check replace view columns earlier r=rharding6373 a=rharding6373

Before this change, we could encounter internal errors while attempting to add result columns during a `CREATE OR REPLACE VIEW` if the number of columns in the new view was less than the number of columns in the old view. This led to an inconsistency with postgres, which would only return the error `cannot drop columns from view`.

This PR moves the check comparing the number of columns before and after the view replacement earlier so that the correct error returns.

Co-authored-by: [email protected]

Fixes: #99000
Epic: None

Release note (bug fix): Fixes an internal error that can occur when `CREATE OR REPLACE VIEW` replaces a view with fewer columns and another entity depended on the view.

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: craig[bot] <[email protected]>
craig bot pushed a commit that referenced this pull request Mar 22, 2023
97697: tsearch: add ts_rank functionality r=jordanlewis a=jordanlewis

Updates: #41288
Epic: CRDB-22357
All but the last commit are from #92966, #97677, and #97685

    This commit adds ts_rank, the family of builtins that allow ranking of
    text search results. The function takes a tsquery and a tsvector and
    returns a float that indicates how good the match is. The function can
    be modified by passing in a custom array of weights that matches the
    text search weights A, B, C, and D, and a bitmask that controls the
    ranking behavior in various detailed ways.

    See the excellent Postgres documentation here for details:
    https://www.postgresql.org/docs/current/textsearch-controls.html

    Release note (sql change): add the ts_rank function for ranking text
    search query results

98776: storage: unify storage/fs.FS and pebble/vfs.FS r=jbowens a=jbowens

The storage/fs.FS had largely the same interface as vfs.FS. The storage/fs.FS interface was intended as a temporary stepping stone to using pebble's vfs.FS interface throughout Cockroach for all filesystem access. This commit unifies the two.

Epic: None
Release note: None

99114: kvserver: fix and unskip TestCheckConsistencyInconsistent r=erikgrinaker a=pavelkalinnikov

This PR unskips `TestCheckConsistencyInconsistent` which was skipped for a reason that no longer holds.

It also fixes the race possible in `TestCheckConsistencyInconsistent`:
- Node 2 is corrupted.
- The second phase of `runConsistency` check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint.
- Node 1 haven't created the checkpoint, but has started doing so.
- Node 2 "fatals" (mocked out in the test) shortly after the check is complete.
- Node 1 is still creating its checkpoint, but has probably created the directory by now.
- Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range.

The test now waits for the moment when all the checkpoint are known to be fully populated.

Fixes #81819
Epic: none
Release note: none

Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants