Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Compile examples of Lucene queries based on filtering params #30911

Closed
Tracked by #30495
jcastro-dotcms opened this issue Dec 10, 2024 · 1 comment
Closed
Tracked by #30495

Comments

@jcastro-dotcms
Copy link
Contributor

Parent Issue

Task

After meeting with @oidacra and @nicobytes , we agreed on me providing a list of sample Lucene queries that must be sent to the back-end based on the different User Searchable fields that can be displayed when looking for related content in the Relationships Field.

Proposed Objective

Core Features

Proposed Priority

Priority 2 - Important

Acceptance Criteria

The value of almost every type of User Searchable field must be carefully formatted in order to be passed down to a Lucene query. This must be taken into consideration for dotCMS to correctly return the expected data.

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

No response

Assumptions & Initiation Needs

No response

Quality Assurance Notes & Workarounds

QA is NOT needed.

Sub-Tasks & Estimates

No response

@jcastro-dotcms
Copy link
Contributor Author

jcastro-dotcms commented Jan 8, 2025

Basic Rules in the Search functionality

  • IMPORTANT: The Java code that handles the content retrieval in the Search portlet and the Relationships field dialog is the same. It's also extremely generic, and handles lots of cases with specific input parameters and value types:

public List searchContentletsByUser(List<BaseContentType> types, String structureInode,
List<String> fields, List<String> categories, boolean showDeleted, boolean filterSystemHost,
boolean filterUnpublish, boolean filterLocked, int page, String orderBy,int perPage,
final User currentUser, HttpSession sess,String modDateFrom, String modDateTo, final String variantName)
throws DotStateException, DotDataException, DotSecurityException {

  • When you select a specific Site, the dialog will ALWAYS add the System Host to the Lucene query as well.
  • When looking for all languages or all Sites, the dialog will just NOT add those search terms to the Lucene query. This makes the API look for content everywhere and/or in every language.
  • In the Search portlet, using the global search field -- at the top of the portlet -- automatically changes the Lucene query to exclude the following Content Types:
    • All the ones marked as System types.
    • Forms.
    • Host.
  • There are specific character escaping operations depending on specific field types. For instance, for metadata fields, we're escaping a specific set of characters.
String specialCharsToEscapeForMetaData = "([+\\-!\\(\\){}\\[\\]^\"~?:/\\\\]{2})";

But, for fields that are NOT of type text, system, date, language ID, or when the "catchall" term is used in the query, we escape their values with another set:

String specialCharsToEscape = "([+\\-!\\(\\){}\\[\\]^\"~*?:\\\\]|[&\\|]{2})";
  • The Search portlet exposes options that are NOT available in the Relationships field dialog, such as:
    • Filter by a selectable Content Type.
    • Exclude contents living under System Host.
    • Filter by Workflow and Workflow Step.
    • Filter by All, Locked, Unpublished, and Archived.

Always-Visible fields in the Relationships dynamic dialog

  • Search: Performs both a 'catchall' and a title search. This is a global search and the BE code adds specific score values to the query:
+contentType:MyTestCT +catchall:global*  title:'global'^15  title_dotraw:*global*^5 +languageId:1 +deleted:false  +working:true +variant:default

With multiple words, it'd be:

+contentType:MyTestCT +catchall:global search*  title:'global search'^15  title:global^5  title:search^5  title_dotraw:*global search*^5 +languageId:1 +deleted:false  +working:true +variant:default
  • Site or Folder: Cannot be flagged as User Searchable, but it always shows up in the UI:
+contentType:MyTestCT +languageId:1 +(conhost:1b70add9497b4d3f6e61c1de16b3cd04 conhost:SYSTEM_HOST) +deleted:false  +working:true +variant:default

When looking in all Sites, the term is just not included in the query so that all contents can be retrieved.

  • Language: It's easily included as in any Lucene query:
+contentType:MyTestCT +languageId:1  +deleted:false  +working:true +variant:default

When looking in all languages, just don't include the term in the query.

User Searchable Fields that can be added to the dynamic dialog

  • Binary: Looks for part of or the whole file name. For instance, if your file name is test-file.txt, you can pass it down like this:
+contentType:MyTestCT +languageId:1 +MyTestCT.binary:filename.txt  +deleted:false  +working:true +variant:default

However, the String of the file name right next to the extension is not being indexed correctly. You can match the file by typing test, but NOT when you type file. You need to type file.txt for the Lucene query to match it.

SUGGESTION: The query should include a wildcard at the end, at least. For instance, when looking for a filename such as password-for-wifi.jpeg the query will match it when set like this:

+contentType:MyTestCT +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797 conhost:SYSTEM_HOST) +MyTestCT.binary:file*  +languageId:1 +deleted:false  +working:true +variant:default

This code change is NOT present yet. It'd need to be added.

  • Block Editor: Generates a simple query:
+contentType:MyTestCT +(MyTestCT.blockEditor:*content*  MyTestCT.blockEditor_dotraw:*content*) +languageId:1 +deleted:false  +working:true +variant:default

When looking for more words, such as "Content 2", the query includes them as separate terms:

+contentType:MyTestCT +(MyTestCT.blockEditor:*content*  MyTestCT.blockEditor_dotraw:*content*) +(MyTestCT.blockEditor:*2*  MyTestCT.blockEditor_dotraw:*2*) +languageId:1 +deleted:false  +working:true +variant:default
  • Category: WARNING: For some reason, the UI displays this at the bottom of the field list! It matches contents that have at least one selected Category. When you select one, it looks like this:
+contentType:MyTestCT +languageId:1 +(categories:mens) +deleted:false  +working:true +variant:default

For more than one, it just adds it like this:

+contentType:MyTestCT +languageId:1 +(categories:mens categories:boys) +deleted:false  +working:true +variant:default
  • Checkbox: WARNING: This one ONLY works in the Search portlet. In the Relationships search dialog, the query looks like this:
+contentType:MyTestCT +languageId:1 +.checkbox0.822209772531758:2* +deleted:false  +working:true +variant:default

Notice the the way the checkbox term is added to the query, which is wrong. It must look like this:

+contentType:MyTestCT +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797 conhost:SYSTEM_HOST) +(MyTestCT.checkbox:*2*  MyTestCT.checkbox_dotraw:*2*) +languageId:1 +deleted:false  +working:true +variant:default

It should be {contentTypevarName}.{fieldVarName} and it should include the _dotraw as well. For multiple values, the Lucene query just includes the same terms for each of them:

+contentType:MyTestCT +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797 conhost:SYSTEM_HOST) +(MyTestCT.checkbox:*1*  MyTestCT.checkbox_dotraw:*1*) +(MyTestCT.checkbox:*2*  MyTestCT.checkbox_dotraw:*2*) +languageId:1 +deleted:false  +working:true +variant:default
  • Constant: This field is NOT user searchable.

  • Custom: Generates a Lucene query that appends the wildcard at the beginning and end of the word, and searches for the _dotraw as well:

+contentType:MyTestCT +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797 conhost:SYSTEM_HOST) +(MyTestCT.custom:*fren*  MyTestCT.custom_dotraw:*fren*) +languageId:1 +deleted:false  +working:true +variant:default
  • Date: You just select the date from the UI widget. The query must ALWAYS duplicate the selected date for it to match data as it needs the range to find it:
+contentType:MyTestCT +languageId:1 +MyTestCT.date:[01/07/2025 TO 01/07/2025] +deleted:false  +working:true +variant:default
  • Date and Time: This works just like the Date field. However, you cannot select the time in the UI:
+contentType:MyTestCT +languageId:1 +MyTestCT.dateAndTime:[01/07/2025 TO 01/07/2025] +deleted:false  +working:true +variant:default

Out of the UI, you can pass down the time using, for instance, a 24-hour format. So, looking for a content with a value of "1:00PM", you can match it using a date and time range like this one:

+contentType:MyTestCT +languageId:1 +MyTestCT.dateAndTime:[01/06/2025 12:00:00 TO 01/06/2025 13:50:00] +deleted:false  +working:true +variant:default

You can use the following format for specifying both date and time: yyyyMMddHHmmss, MM/dd/yyyy hh:mm:ss[AM|PM], or MM/dd/yyyy HH:mm:ss.

  • File: This field is NOT searchable, and NOT showing up in the UI.

  • Hidden: This field is NOT searchable, and NOT showing up in the UI.

  • Image: This field is NOT searchable, and NOT showing up in the UI.

  • JSON: It matches any part of the JSON value. Splits the words you type in into several search terms, so if you look for "content", it looks like this:

+contentType:MyTestCT +languageId:1 +(MyTestCT.json:*content*  MyTestCT.json_dotraw:*content*) +deleted:false  +working:true +variant:default

But if you look for "Content 2", it looks like this:

+contentType:MyTestCT +languageId:1 +(MyTestCT.json:*content*  MyTestCT.json_dotraw:*content*) +(MyTestCT.json:*2*  MyTestCT.json_dotraw:*2*) +deleted:false  +working:true +variant:default
  • Key/value: This value matches either the key or the value. But, the Lucene query generated from the Relationships search dialog is wrong:
+contentType:MyTestCT +languageId:1 +MyTestCT.keyValue.:*first* +deleted:false  +working:true +variant:default

Notice that the term +MyTestCT.keyValue. has an ending period only. The String "key_value" must go after that. Also, if you look for more than one word, it must look like this:

+contentType:MyTestCT +languageId:1 +(MyTestCT.keyValue.key_value:*the* MyTestCT.keyValue.key_value:*first*)
  • Multi-select: Matches the value of the selected option, always looking for both the field and its _dotraw version:
+contentType:MyTestCT +languageId:1 +(MyTestCT.multiSelect:*1*  MyTestCT.multiSelect_dotraw:*1*) +deleted:false  +working:true +variant:default

If you select more than one entry, The query looks like this:

+contentType:MyTestCT +languageId:1 +(MyTestCT.multiSelect:*1*  MyTestCT.multiSelect_dotraw:*1*) +(MyTestCT.multiSelect:*2*  MyTestCT.multiSelect_dotraw:*2*) +deleted:false  +working:true +variant:default
  • Radio: Matches the value of the selected option, always looking for both the field and its _dotraw version:
+contentType:MyTestCT +languageId:1 +(MyTestCT.radio:*2*  MyTestCT.radio_dotraw:*2*) +deleted:false  +working:true +variant:default
  • Relationships: This dropdown allows you to look for the related content in it using its title, or lets you type in the Identifier of the content that is related to the one you want to look for. That is, this field takes the child content of the relationship.

IMPORTANT: The dropdown ONLY shows up in the Search portlet, not in the Relationships field search dialog.

The backend takes such an Identifier and retrieves the parents that include it. Let's consider the following example:

You have a the following Identifiers:

  • Parent 1 with ID 34c69545258d89ba9237abf783f96db9
  • Child 1 with ID 79d278f559e40dbc3fc6f2948b2031f2

In the Relationships dialog, you want to look for contents that are referencing the child content 79d278f559e40dbc3fc6f2948b2031f2, so you enter such a value. Internally, the current DWR class will use it to look for all Contentlets that are referencing it, and will produce a Lucene query like this one:

+contentType:MyTestCT +languageId:1 +deleted:false  +working:true +variant:default +identifier:(34c69545258d89ba9237abf783f96db9) 

If you have one more parent Contentlet, say Parent 2 with ID cef5d3f3c53de3afa7eb0e263e290f66, the Lucene query will be updated to include both parents, and will look like this:

+contentType:MyTestCT +languageId:1 +deleted:false  +working:true +variant:default +identifier:(34c69545258d89ba9237abf783f96db9 OR cef5d3f3c53de3afa7eb0e263e290f66) 

This is something that might not be doable from the Angular layer, so we might need to have a new REST Endpoint for it.

  • Select: Matches the value of the selected option, always looking for both the field and its _dotraw version:
+contentType:MyTestCT +languageId:1 +(MyTestCT.select:*2* MyTestCT.select_dotraw:*2*) +deleted:false +working:true +variant:default

IMPORTANT: The UI is automatically adding a "None" option to the select. This is also happening in the Search portlet, but the "None" word is not showing up.

  • Tag: It takes the whole name of the Tag, NO wildcards allowed:
+contentType:MyTestCT +languageId:1 +MyTestCT.tag:"beach" +deleted:false +working:true +variant:default

For more than one Tag, the term is repeated like this:

+contentType:MyTestCT +MyTestCT.tag:"beach" +MyTestCT.tag:"snow sports" +variant:default

IMPORTANT: The Search portlet provides a Tag predictor after typing 3 characters. This is NOT happening in the Relationships search dialog.

  • Text: This is a usual text search. When looking for one word such as "content":
+contentType:MyTestCT +languageId:1 +(MyTestCT.title:*content*  MyTestCT.title_dotraw:*content*) +deleted:false  +working:true +variant:default

When looking for "content 2", it just repeats the search terms:

+contentType:MyTestCT +languageId:1 +(MyTestCT.title:*content*  MyTestCT.title_dotraw:*content*) +(MyTestCT.title:*1*  MyTestCT.title_dotraw:*1*) +deleted:false  +working:true +variant:default
  • Text Area: Works the same as the Text field, repeating the search terms when looking for more than 1 word, like this:
+contentType:MyTestCT +(MyTestCT.textArea:*area* MyTestCT.textArea_dotraw:*area*) +(MyTestCT.textArea:*2* MyTestCT.textArea_dotraw:*2*) +variant:default

-> Time: Takes a time value, such as: 1:00PM , or 12:45PM. The times are set as ranges, like the Date fields:

+contentType:MyTestCT +languageId:1 +MyTestCT.time:[1:15PM TO 1:15PM] +deleted:false +working:true +variant:default

It also takes time ranges. So, if you're looking for a content with "time" between 1:13PM and 1:16PM, the query will look like this:

+contentType:MyTestCT +languageId:1 +MyTestCT.time:[1:13PM TO 1:16PM] +deleted:false  +working:true +variant:default

Using square brackets and the word to or TO between the values.

-> WYSIWYG: It matches any part of the content in the field. Just like the Text and Text Area fields, it splits several words into several search terms. When looking for Content 2, it looks like this:

+contentType:MyTestCT +(MyTestCT.wysiwyg:*content* MyTestCT.wysiwyg_dotraw:*content*) +(MyTestCT.wysiwyg:*2* MyTestCT.wysiwyg_dotraw:*2*) +deleted:false +working:true +variant:default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants