Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4432 Fixed phrase search queries with § #4877

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented Jan 1, 2025

This PR fixes #4432

The issue originated in the custom_word_delimiter_filter, which was splitting terms like §247 into 247 and §247. Initially, I removed custom_word_delimiter_filter from search_analyzer_exact as suggested in #4432.

However, this change caused docketNumber proximity queries to fail. For example, a query like 21-1234 could not match 1:21-bk-1234. The problem occurs because this query is rewritten as docketNumber:"21-1234"~1. Due to the custom_word_delimiter_filter, the - is treated as a separator, so internally ES transforms this query to docketNumber:"21 1234"~1, meaning a proximity query between tokens 21 and 1234 with a max distance of 1.

Meanwhile, due to this same filter, 1:21-bk-1234 is indexed as ["1:21-bk-1234", "1", "21", "bk", "1234"], allowing the query 21-1234 to match 1:21-bk-1234.

Removing this filter completely caused the docketNumber query to fail because it was internally interpreted as docketNumber:"21-1234"~1, which is a fuzzy query for a single term with max distance of 1. The problem is that the maximum edit distance allowed in fuzzy queries is 2, and matching 1:21-bk-1234 requires more than 2 changes.

Therefore, removing custom_word_delimiter_filter completely is not an option, either at search or indexing time. The alternative solution for the § issue is to add an exception so § is not considered a character to split tokens over. This can be done using the word_delimiter type_table parameter and mapping § as an Alphanum character to avoid splitting.

This change resolves the issue described in #4432 at search time. Queries like "§247" will match documents containing exactly §247 and ignore documents containing only 247.

However, I also tested the reverse case:

  • A query for "247" will still match currently indexed documents containing only §247 since these documents were split and indexed as ["§247", "247"].
  • New documents indexed after applying the custom_word_delimiter_filter change won't be split on §, but existing indexed documents would require a full re-index to fix such queries.

I also noticed that non-phrase queries like §247 still don't match documents containing §247 after the filter tweak. This was a different issue: within cleanup_main_query, terms are split on special characters (with some exceptions). For non-phrase queries, §247 was converted to § "247". To fix this, I added § as an exception in the cleanup_main_query regex.

More Exceptions to add?

  • Since we're relying on adding special terms as exceptions during splitting, are there other special terms common in legal documents that we should add to the exception list?

  • Would it make sense to avoid splitting on $ and %? For example, keeping $1000 or %10 as single terms for searching?

To apply this fix in production, the following filter modifications should be applied to each index:

  • opinion_index
  • oral_arguments_percolator_vectors
  • oral_arguments_vectors
  • people_vectors
  • recap_percolator
  • recap_vectors
POST
https://localhost:9200/{index_name}/_close
PUT
https://localhost:9200/{index_name}/_settings

{
   "settings":{
      "analysis":{
         "filter":{
            "custom_word_delimiter_filter":{
               "type":"word_delimiter",
               "type_table": [
                    "§ => ALPHANUM",
                    "$ => ALPHANUM",
                    "% => ALPHANUM",
                    "¶ => ALPHANUM",
                ],
               "split_on_numerics":false,
               "preserve_original":true
            }
         }
      }
   }
}
POST
https://localhost:9200/{index_name}/_open

@albertisfu albertisfu force-pushed the 4432-fix-search-query-with-section-mark branch from 5684f6a to 9b37bec Compare January 1, 2025 18:25
@albertisfu albertisfu marked this pull request as ready for review January 1, 2025 20:12
@albertisfu albertisfu requested a review from mlissner January 1, 2025 20:12
@mlissner
Copy link
Member

mlissner commented Jan 1, 2025

Fun, fun, fun. Sounds like you got the right solution.

Yes, let's add $ and % to the list too. I think the other one to add might be the pilcrow: thank you!

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with the other couple of punctuation marks added (and tests as well).

Thanks!

@albertisfu
Copy link
Contributor Author

Great!
Added $ % ¶ as exceptions as well.
Tests have been updated.

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @albertisfu!.

@mlissner
Copy link
Member

mlissner commented Jan 3, 2025

Alberto, can you spin off an infrastructure issue for Diego to do on the index changes, please?

@albertisfu
Copy link
Contributor Author

Infrastructure Issue here:
https://github.com/freelawproject/infrastructure/issues/221

This PR can be merged without risk; however, the fix won't take effect until the infrastructure issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Can't search with "§" in query
3 participants