SOLR: anybody out there with (technical) experience? #327
Unanswered
GerHobbelt
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Googling is fun and reading manuals is swell, but I think it would be very handy indeed if someone could whisper in my ear how to go about it on my first try 😉 shortening my learning curve.
Qiqqa currently uses an antique Lucene.NET library and I intend to move Qiqqa to have it use SOLR instead.
The idea being that:
SOLR on the other hand has had its up & downs too (slow maintenance), but has at least remained reasonably close to Lucene, which is the core bit that we need and has not died yet, despite the ES marketing. SOLR also has picked up the pace quite a bit since a few years -- I guestimate that's in part due to the ES folks' shenanigans with their licencing.
I'm sure users will come up with search queries that I haven't thought about ever and giving access to the search database might be very useful for some of you.
If you've set up, configured and/or tweaked your own SOLR instance(s), I'm looking for you!
There's plenty details where I could use your advice:
I'm changing the file format for the 'extracted text' of any PDF anyway (as part of the old
pdfdraw -tt
to new mupdfmutool
migration) and would love to hear from folks who've fed stuff like HTML, hOCR (HTML extension for OCR outputs) or similar file formats to SOLR for indexing.There's also the Qiqqa metadata to consider: title, author, etc.: as that collection will probably grow a little, including BibTeX to JSON translations, etc., the question then becomes: how to optimally feed that to SOLR so I can find my "document fingerprints" (which identify the document related to that text / metadata) when I go and search for, say, author only, or title and publisher only, etc.etc.
What's the size of this thing going to be, i.e. how many GBytes does it take when you feed it a GB of text to index? Any (rough) estimation rules? (My google fu on that subject has been sub-par I guess. Or is nobody bothering about storage space any more?
Qiqqa currently only uses Lucene (Lucene.NET) to produce a hit for a search, i.e. the search has Lucene.NET spit out a list of document fingerprints that 'match' the current search; then the search continues with the next phase where the cached 'extracted text' is loaded and matched against the search criteria to give you highlighting, etc. Nice, but this approach is right out when we are to allow SOLR advanced search queries and the like: I do not want to re-implement that search query parser so I'll have to become able to let Lucene/SOLR tell me where which hit is exactly in each page.
I know SOLR has highlighting output, but AFAICT only at the visual (HTML page output) level, where I would have to scrape and interpret that page to determine the spot within the document. If I want hit quality scores with that, the very limited info I found tells me that I'll have to write my own SOLR add-on to extract that sort of info in a "machine readable fashion" -- which is what Qiqqa needs.
If you've done this sort of thing, I'ld like a ping. (And if I can get it done without having to code or re-compile any Java, that'ld be super! 🤡 )
Anyhow. If you've got coding and/or admin experience with SOLR and willing to share, teach or do, please holler 🛎️ 🛎️ 🛎️ 👍
Beta Was this translation helpful? Give feedback.
All reactions