SOLR/LUCENE Experts, please help me design a simple keyword search from PDF index?

杀马特。学长 韩版系。学妹 提交于 2019-12-23 02:51:07


I dabbled with solr but couldn't figure out a way to tailor it to my reuqirement.

What I have :

A bunch of PDF files. A set of keywords.

What I am trying to achieve :

Index the PDF files (solrcell - done) Search for a keyword (works ok) Tailor the output to spit out the names of the PDF files, an excerpt where the keyword occurred (No clue/idea how to)

Tried manipulating ResponseHandler/Schema.xml/Solrconfig.xml to no avail.

Lucene/solr experts, do you think what I am trying to achieve is possible?

I put my existing code on github @ (which is mostly solr's default example with minor modifications to the fields (all the content is stored in one content field).

Notable changes in schema.xml being :

Schema.xml :

<solrQueryParser defaultOperator="AND"/>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <dynamicField name="*" type="string"    indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

<solrQueryParser defaultOperator="AND"/>

<copyField source="*" dest="content"/>

Current Output :

(query) http://localhost:8983/solr/select/?q=Java+Servlet&version=2.2&start=0&rows=10&indent=on

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>

<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>

What I am looking for is 'extracted fragment (line) where the keyword was found'.

In the query provided, I search for 'Java Servlet' and it returned the document. I am interested in the context 'Solr can run in any Java Servlet Container of your choice' to be returned in the output xml.


To get snippets of text around the matched keywords, see

To get the filename of the indexed PDF as part of the response, simply add a field with that information (it should be a string field, non-indexed, stored). Of course, you have to populate this new field at index-time.


A standalone solution using PDF Box and Apache Lucene is available at: * It will create a HTML file with links to the corresponding pages in the PDF file where the keywords were found.

