Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doubt about mangleSearchableText/isSimpleSearch functions #181

Open
mamico opened this issue Nov 24, 2017 · 4 comments
Open

doubt about mangleSearchableText/isSimpleSearch functions #181

mamico opened this issue Nov 24, 2017 · 4 comments

Comments

@mamico
Copy link
Contributor

mamico commented Nov 24, 2017

here:

https://github.com/collective/collective.solr/blob/master/src/collective/solr/mangler.py#L74

mangleSearchableText return raw searchabletext value if isSimpleSearch is not True, and that is if the value ends with a digit:

https://github.com/collective/collective.solr/blob/master/src/collective/solr/utils.py#L116

So if you search by "something about Plone5" the "mangle" machinery is really different (in a wrong way IMHO) than "something about Plone".

@mamico
Copy link
Contributor Author

mamico commented Dec 7, 2017

@hannosch this is the commit where you has introduced check against search ending with numbers:

4bbc1a7

do you have a real use case for your change?

@hannosch
Copy link

hannosch commented Dec 8, 2017

I’ve tried to dig into the history of this. IIRC I introduced this as searches for something like Plone5 got translated into Plone 5*. The default schema at the time was splitting Plone5 into two seperate terms, Plone and the number 5 on its own. And at query time the isSimpleSearch function was used to decide whether or not to automatically append the * and turn the query into a wildcard query.

This change was a workaround so single words containing numbers weren’t consider simple anymore and as a result didn’t get the automatic wildcard treatment. Instead they were used literally as-is. At least at the time of this change, a Solr query for 5* resulted in a Solr error. The commit message at least hints at this.

Looking at the history of the default schema generated by collective.recipe.solr, I made some related changes a couple of months after this workaround. Of specific note are collective/collective.recipe.solrinstance@a8361b0#diff-e9704fdb9716ae57883502dd0b393d72 and collective/collective.recipe.solrinstance@5a57413#diff-e9704fdb9716ae57883502dd0b393d72.

Before those changes the text field used the WordDelimiterFilterFactory without specifying a value for its splitOnNumerics attribute. The default value for this is 1 meaning true. This is what let to the splitting of Plone5 into two seperate terms.

After those two commits, the WordDelimiterFilterFactory is used again, but with an explicit splitOnNumerics=0 - meaning it no longer splits on numbers.

With all this said, I think the real fix here was the schema change. The workaround in isSimpleSearch can probably be removed again. But I’m not involved in this project anymore, so I can’t make a decision on that.

As a side note the automatic wildcard treatment for simple searches was always questionable. It was behavior we preserved from the old ZCTextIndex implementation. But I cannot say whether or not this actually matches user expectations or if there isn’t a better way to deal with this in modern Solr.

@mamico
Copy link
Contributor Author

mamico commented Dec 17, 2017

@tisto what is your opinion on this?

@tisto
Copy link
Member

tisto commented Dec 18, 2017

@mamico my feeling is that this is something that we could solve in a better way on Solr level. Though, I currently don't really have time to dig deeper into that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants