Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsearch: add ts_rank functionality #97697

Merged
merged 1 commit into from
Mar 22, 2023
Merged

Conversation

jordanlewis
Copy link
Member

Updates: #41288
Epic: CRDB-22357
All but the last commit are from #92966, #97677, and #97685

This commit adds ts_rank, the family of builtins that allow ranking of
text search results. The function takes a tsquery and a tsvector and
returns a float that indicates how good the match is. The function can
be modified by passing in a custom array of weights that matches the
text search weights A, B, C, and D, and a bitmask that controls the
ranking behavior in various detailed ways.

See the excellent Postgres documentation here for details:
https://www.postgresql.org/docs/current/textsearch-controls.html

Release note (sql change): add the ts_rank function for ranking text
search query results

@jordanlewis jordanlewis requested a review from a team February 27, 2023 05:51
@jordanlewis jordanlewis requested review from a team as code owners February 27, 2023 05:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis requested review from rafiss and rytaft February 27, 2023 06:01
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1, 2 of 2 files at r2, 25 of 25 files at r3, 14 of 14 files at r4, 9 of 9 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @jordanlewis, and @rafiss)


pkg/sql/sem/builtins/tsearch_builtins.go line 338 at r5 (raw file):

	ret := make([]float32, 4)
	if arr.Len() < len(ret) {
		return ret, pgerror.New(pgcode.ArraySubscript, "array of weight is too short")

do you want to give a hint that length 4 is expected?


pkg/util/tsearch/rank.go line 24 at r5 (raw file):

// Bitmask for the normalization integer. These define different ranking
// behaviors. They're defined in Postgres in tsrank.c.

I looked at that file, but it doesn't make things much clearer. Can you provide any high level comments about what these are?


pkg/util/tsearch/rank.go line 38 at r5 (raw file):

)

func cntLen(v TSVector) int {

nit: add a function comment


pkg/util/tsearch/rank.go line 55 at r5 (raw file):

// tsvector lexeme weights D, C, B, and A. The method parameter is a bitmask
// defining different ranking behaviors, defined in the rankBehavior type. See
// the Postgres documentation for more details.

is there a link?


pkg/util/tsearch/rank.go line 71 at r5 (raw file):

	}
	if res < 0 {
		res = 1e-20

nit: what does this constant mean?


pkg/util/tsearch/rank.go line 83 at r5 (raw file):

		}
	}
	// RANK_NORM_EXTDIST not applicable

why not?


pkg/util/tsearch/rank.go line 120 at r5 (raw file):

		return leafNodes[i].term.lexeme < leafNodes[j].term.lexeme
	})
	// Then distinct: (wouldn't it be nice if Go had generics?)

It does now... does that change anything here?


pkg/util/tsearch/rank.go line 135 at r5 (raw file):

}

func findRankMatches(

nit: add a function comment. What are the parameters?


pkg/util/tsearch/rank.go line 158 at r5 (raw file):

}

func rankOr(weights [4]float32, v TSVector, q TSQuery) float32 {

add a function comment


pkg/util/tsearch/rank.go line 187 at r5 (raw file):

		// wi - should be sorted desc,
		// don't sort for now, just choose maximum weight. This should be corrected
		// Oleg Bartunov

Does this make sense to you? I'm scratching my head...


pkg/util/tsearch/rank.go line 196 at r5 (raw file):

}

func rankAnd(weights [4]float32, v TSVector, q TSQuery) float32 {

add a function comment


pkg/sql/logictest/testdata/logic_test/tsvector line 323 at r5 (raw file):

----
'ari':6 'base':3,13 'concept':17 'data':12,21 'form':10 'introduc':24 'model':2 'n':5 'normal':9 'relat':7,14 'sublanguag':22 'univers':20                 0.075990885
'2':3 'appli':15 'certain':4 'consist':22 'discuss':13 'infer':11 'logic':10 'model':27 'oper':5 'problem':18 'redund':20 'relat':7 'section':2 'user':25  0.06079271

How about testing the other overloads of ts_rank?

@jordanlewis jordanlewis force-pushed the ts_rank branch 2 times, most recently from 7ffd880 to 1c11d5c Compare March 15, 2023 04:51
@jordanlewis jordanlewis requested a review from rytaft March 15, 2023 04:52
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @rafiss, and @rytaft)


pkg/sql/sem/builtins/tsearch_builtins.go line 338 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

do you want to give a hint that length 4 is expected?

I'm copying the inadequate Postgres message here. I suppose we can do better, it probably won't break compatibility... Done


pkg/util/tsearch/rank.go line 24 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I looked at that file, but it doesn't make things much clearer. Can you provide any high level comments about what these are?

Done.


pkg/util/tsearch/rank.go line 38 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: add a function comment

Done.


pkg/util/tsearch/rank.go line 55 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

is there a link?

Done.


pkg/util/tsearch/rank.go line 71 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: what does this constant mean?

I have no idea, it's not documented in the Postgres source. I copied this verbatim, I really have no idea what it does. I'll clarify this fact in the code, I wish I had something better to say here.


pkg/util/tsearch/rank.go line 83 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why not?

Done.


pkg/util/tsearch/rank.go line 120 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

It does now... does that change anything here?

We still don't have access to x/exp/slices at this Go version which implements slices.Compact, once that's added we can replace a whole bunch of copypasta that looks just like this throughout the codebase. I'll file an issue.


pkg/util/tsearch/rank.go line 158 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

add a function comment

Done.


pkg/util/tsearch/rank.go line 187 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Does this make sense to you? I'm scratching my head...

No, I unfortunately have no idea what this means. I figured I'd just include it in case it helps somebody, lol.


pkg/util/tsearch/rank.go line 196 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

add a function comment

Done.


pkg/sql/logictest/testdata/logic_test/tsvector line 323 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

How about testing the other overloads of ts_rank?

Done.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Only have a couple nits.

Reviewed 1 of 2 files at r2, 42 of 42 files at r6, 31 of 31 files at r7, 16 of 16 files at r8, 9 of 10 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @rafiss, and @rytaft)


pkg/util/tsearch/tsvector.go line 339 at r9 (raw file):

// routines like to_tsvector and to_tsquery.
// It can return false in the second parameter to indicate a stopword was found.
func TSLexize(config string, token string) (lexeme string, notAStopWord bool, err error) {

[super nit] The negative in notAStopWord feels confusing, why not just stopWord and return false by default?


pkg/util/tsearch/tsvector.go line 376 at r9 (raw file):

		pos := i + 1
		if i > maxTSVectorPosition {
			pos = maxTSVectorPosition

[nit] Could you add a comment

// Postgres silently truncates positions larger than 16383 to 16383.

to mirror tsVectorLexer.lex?

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 42 of 42 files at r6, 31 of 31 files at r7, 16 of 16 files at r8, 5 of 10 files at r9, 5 of 5 files at r10, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @rafiss)

Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @rafiss, and @rytaft)


pkg/util/tsearch/tsvector.go line 339 at r9 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[super nit] The negative in notAStopWord feels confusing, why not just stopWord and return false by default?

Done.


pkg/util/tsearch/tsvector.go line 376 at r9 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] Could you add a comment

// Postgres silently truncates positions larger than 16383 to 16383.

to mirror tsVectorLexer.lex?

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 38 of 38 files at r11, 33 of 33 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball and @rafiss)

This commit adds ts_rank, the family of builtins that allow ranking of
text search results. The function takes a tsquery and a tsvector and
returns a float that indicates how good the match is. The function can
be modified by passing in a custom array of weights that matches the
text search weights A, B, C, and D, and a bitmask that controls the
ranking behavior in various detailed ways.

See the excellent Postgres documentation here for details:
https://www.postgresql.org/docs/current/textsearch-controls.html

Release note (sql change): add the ts_rank function for ranking text
search query results
@jordanlewis
Copy link
Member Author

TFTRs!

bors r+

@jordanlewis jordanlewis added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 22, 2023
@craig
Copy link
Contributor

craig bot commented Mar 22, 2023

Build succeeded:

@craig craig bot merged commit f592975 into cockroachdb:master Mar 22, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 22, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from f1ce8cf to blathers/backport-release-23.1-97697: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

jordanlewis added a commit to jordanlewis/cockroach that referenced this pull request Mar 23, 2023
Inadvertent leftover from cockroachdb#97697.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 23, 2023
99089: backup: use locality filter to restrict processes to locality r=dt a=dt

This extends the previous locality filter option that pinned the backup job to a specific
locality filter to also pin the processors which backup the actual row data as well, allowing
assignment of a backup job to only the instances running in a particular locality.

Those instances will attempt to read from the closest replica they can, however a potential
optimization would be to pick an instace closest to a replica that also matches the locality
filter if there is one -- currently it will likely just pick the instance closest to the
leaseholder which matches the filter, however there may be an instance closer to a usable
replica that is not the leaseholder. To make use of these replicas, a replica oracle that
ranks replicas matching the filter would need to be passed when locality filtering placement
is in-use; this optimization is left to a follow-up.

Release note (sql change): the nodes involved in the execution of backups, including processing of row data and the job coordination, can now be controlled using an execution-locality filter.
Epic: CRDB-9547.




99324: tsearch: remove commented out code r=jordanlewis a=jordanlewis

Inadvertent leftover from #97697.

Epic: None
Release note: None

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants