问题
I dabbled with solr but couldn't figure out a way to tailor it to my reuqirement.
What I have :
A bunch of PDF files. A set of keywords.
What I am trying to achieve :
Index the PDF files (solrcell - done) Search for a keyword (works ok) Tailor the output to spit out the names of the PDF files, an excerpt where the keyword occurred (No clue/idea how to)
Tried manipulating ResponseHandler/Schema.xml/Solrconfig.xml to no avail.
Lucene/solr experts, do you think what I am trying to achieve is possible?
I put my existing code on github @ https://github.com/ThinkCode/solr_search (which is mostly solr's default example with minor modifications to the fields (all the content is stored in one content field).
Notable changes in schema.xml being :
Schema.xml :
<solrQueryParser defaultOperator="AND"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
<dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
<solrQueryParser defaultOperator="AND"/>
<copyField source="*" dest="content"/>
Current Output :
(query) http://localhost:8983/solr/select/?q=Java+Servlet&version=2.2&start=0&rows=10&indent=on
<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>
<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>
What I am looking for is 'extracted fragment (line) where the keyword was found'.
In the query provided, I search for 'Java Servlet' and it returned the document. I am interested in the context 'Solr can run in any Java Servlet Container of your choice' to be returned in the output xml.
回答1:
To get snippets of text around the matched keywords, see http://wiki.apache.org/solr/HighlightingParameters
To get the filename of the indexed PDF as part of the response, simply add a field with that information (it should be a string field, non-indexed, stored). Of course, you have to populate this new field at index-time.
回答2:
A standalone solution using PDF Box and Apache Lucene is available at: * https://github.com/WolfgangFahl/pdfindexer It will create a HTML file with links to the corresponding pages in the PDF file where the keywords were found.
来源:https://stackoverflow.com/questions/6916369/solr-lucene-experts-please-help-me-design-a-simple-keyword-search-from-pdf-inde