Searching for a string in a pdf files

后端未结

关注

 3  1982

I am working on a school project that has several pdf files. There should be a search by name functionality that I just type in the student\'s name and all the pdf files wit

相关标签:

3条回答

忘了有多久

2020-12-21 00:10

Use iTextSharp. It's free and you only need the "itextsharp.dll".

http://sourceforge.net/projects/itextsharp/

Here is a simple function for reading the text out of a PDF.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

Now you can search through those files with ease.

0 讨论(0)

时光取名叫无心

2020-12-21 00:27

PDF is a very complex specification and it is possible to create so many variants that it is impossible to parse reliably unless you use the same tools to read it as were used to create it (and often not even then). There are several tools which flatten PDF to a text string (e.g. pdf2text) and it may be possible to search these but it's unreliable.

Many PDF tools only implement some of the spec. Some people suggest that the best way to search PDF is to reduce it to an image and then OCR that.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-12-21 00:34
I think your task may be split as follows:
- Build index of PDF files
- Write some code that will use the index to locate relevant PDF whenever a search performed
- Write some code that will open found PDF or show a warning if nothing was found
To build index you may use some integrated solution like Apache Lucene or Lucene.Net or convert each PDF into text and build index from the text yourselves.

Other two steps are fairly trivial and depend on language/technology used in first step.

Your question is tagged as related to .NET, so you may try Docotic.Pdf library for index building (disclaimer: I work for Bit Miracle).

Docotic.Pdf may be used to extract text from PDF files as plain text or as collection of text chunks with coordinates for each chunk.
0 讨论(0)
发布评论:

提交评论
- 加载中...