问题
I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
回答1:
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number>
and an additional field doc_id which contains only the <id_of_document>
for grouping the results.
回答2:
There is JIRA SOLR-380 with a Patch, which you can check upon.
回答3:
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
回答4:
I have not tried it myself. Approach,
- Solr customer connector integrating with Apache Tika parser for indexing PDFs
- Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
- In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
- Enable search on all the “page” attributes
- When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
- The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
- Link the PDF with the “#PageNumber” of the PDF and pop up the page on click
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.
来源:https://stackoverflow.com/questions/15116160/get-page-numbers-of-searchresult-of-a-pdf-in-solr