How to index Word 2003, 2007 and 2010 documents using Lucene.NET

前端 未结 1 660
天涯浪人
天涯浪人 2021-02-06 12:28

I am writing a custom Lucene.NET indexer to enable indexing of MS Word documents. The indexer must be capable of handling last three releases of MS Word: 2010, 2007 and 2003.

相关标签:
1条回答
  • 2021-02-06 13:01

    You could you use the IFilter plugins to let you retrieve the contents of the documents and then index them. The interface is originally part of Microsoft Index Service but is generally available for indexing documents.

    I looked into the technology a couple of years ago and seem to remember that either the filters for Office documents were built into Windows or could be installed separately from the complete Office package but I may be wrong here.

    More about the IFilter technology at IFilter at Wikipedia and IFilter at MSDN. You will have to look into P/Invoke and might get some inspiration IFilter at pinvoke.net.

    A sample in C# can be found at MSDN Code Gallery.

    0 讨论(0)
提交回复
热议问题