SOLR: anybody out there with (technical) experience? #327

GerHobbelt · 2021-05-10T14:48:21Z

GerHobbelt
May 10, 2021
Collaborator

Googling is fun and reading manuals is swell, but I think it would be very handy indeed if someone could whisper in my ear how to go about it on my first try 😉 shortening my learning curve.

Qiqqa currently uses an antique Lucene.NET library and I intend to move Qiqqa to have it use SOLR instead.

The idea being that:

I don't have to sweat over a codebase that seems to be lagging behind the bleeding edge: Lucene.NET is a port and has fewer people attending to it than the mainline: Lucene itself.
SOLR or ElasticSearch are my answers when asked "how about interfacing with Java, i.e. Lucene API directly?" -- I love Java. From a distance. 😉 Besides, I don't want to bother with getting Java code to interface with .NET code; give me sockets or pipes anytime (IPC).
ElasticSearch: I've looked at it, munched on more blogs and articles than a sane person probably should and I've come away with the feeling that ES is a bit too commercial to my taste: they've pulled a few stunts with their open source releases that I don't like at all. Makes them look like buggers who use OSS as a kickstarter, which reads to me like they'll pull the plug when their then-current investors say "jump".
SOLR on the other hand has had its up & downs too (slow maintenance), but has at least remained reasonably close to Lucene, which is the core bit that we need and has not died yet, despite the ES marketing. SOLR also has picked up the pace quite a bit since a few years -- I guestimate that's in part due to the ES folks' shenanigans with their licencing.
When Qiqqa uses SOLR on your local machine, so can you.
I'm sure users will come up with search queries that I haven't thought about ever and giving access to the search database might be very useful for some of you.

If you've set up, configured and/or tweaked your own SOLR instance(s), I'm looking for you!

There's plenty details where I could use your advice:

I'm changing the file format for the 'extracted text' of any PDF anyway (as part of the old pdfdraw -tt to new mupdf mutool migration) and would love to hear from folks who've fed stuff like HTML, hOCR (HTML extension for OCR outputs) or similar file formats to SOLR for indexing.
There's also the Qiqqa metadata to consider: title, author, etc.: as that collection will probably grow a little, including BibTeX to JSON translations, etc., the question then becomes: how to optimally feed that to SOLR so I can find my "document fingerprints" (which identify the document related to that text / metadata) when I go and search for, say, author only, or title and publisher only, etc.etc.
What's the size of this thing going to be, i.e. how many GBytes does it take when you feed it a GB of text to index? Any (rough) estimation rules? (My google fu on that subject has been sub-par I guess. Or is nobody bothering about storage space any more?
Qiqqa currently only uses Lucene (Lucene.NET) to produce a hit for a search, i.e. the search has Lucene.NET spit out a list of document fingerprints that 'match' the current search; then the search continues with the next phase where the cached 'extracted text' is loaded and matched against the search criteria to give you highlighting, etc. Nice, but this approach is right out when we are to allow SOLR advanced search queries and the like: I do not want to re-implement that search query parser so I'll have to become able to let Lucene/SOLR tell me where which hit is exactly in each page.

I know SOLR has highlighting output, but AFAICT only at the visual (HTML page output) level, where I would have to scrape and interpret that page to determine the spot within the document. If I want hit quality scores with that, the very limited info I found tells me that I'll have to write my own SOLR add-on to extract that sort of info in a "machine readable fashion" -- which is what Qiqqa needs.
If you've done this sort of thing, I'ld like a ping. (And if I can get it done without having to code or re-compile any Java, that'ld be super! 🤡 )

Anyhow. If you've got coding and/or admin experience with SOLR and willing to share, teach or do, please holler 🛎️ 🛎️ 🛎️ 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR: anybody out there with (technical) experience? #327

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

SOLR: anybody out there with (technical) experience? #327

GerHobbelt May 10, 2021 Collaborator

If you've set up, configured and/or tweaked your own SOLR instance(s), I'm looking for you!

Replies: 0 comments

GerHobbelt
May 10, 2021
Collaborator