Extract PDF text by coordinates

前端 未结 6 767
半阙折子戏
半阙折子戏 2021-02-04 20:03

I\'d like to know if there\'s some PDF library in Microsoft .NET being able of extracting text by giving coordinates.

For example (in pseudo-code):

<         


        
相关标签:
6条回答
  • 2021-02-04 20:42

    You may wanna look at this sample. It uses itextsharp

    var pdfFilename = @"PathToYourPDF\random.pdf";
    var textToFind = "Lombok";
    var pageNumber = 1;
    var point = PdfTools.GetTextCoordinate(textToFind, pdfFilename , pageNumber);
    Console.WriteLine($"{point.X},{point.Y}");
    
    0 讨论(0)
  • 2021-02-04 20:43

    Well, thank you for your effort anyone.

    I got it using Apache's PDFBox on top of IKVM compilation, and this is the final code:

    PDDocument doc = PDDocument.load(@"c:\invoice.pdf");
    
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.addRegion("testRegion", new java.awt.Rectangle(0, 10, 100, 100));
    stripper.extractRegions((PDPage)doc.getDocumentCatalog().getAllPages().get(0));
    
    string text = stripper.getTextForRegion("testRegion");
    

    And it works like a charm.

    Thank you anyway and I hope my own answer will help others. If you need further details, just comment out here and I'll update this answer.

    0 讨论(0)
  • 2021-02-04 20:43

    This code will work in itext 7

    PdfReader reader = new PdfReader("D:/Sample2.pdf");
    PdfDocument pdfDoc = new PdfDocument(reader);
    Rectangle rect = new Rectangle(208, 508, 235, 519);
    TextRegionEventFilter regionFilter = new 
    TextRegionEventFilter(rect.SetBbox(208, 508, 235, 519));
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    FilteredEventListener listener = new FilteredEventListener();
    LocationTextExtractionStrategy extractionStrategy = listener.AttachEventListener(new LocationTextExtractionStrategy(), regionFilter);
    new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetPage(1));
    String text = extractionStrategy.GetResultantText();
    
    0 讨论(0)
  • 2021-02-04 20:44

    This should work:

    RenderFilter[] filters = new RenderFilter[1];
    LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
    filters[0] = new RegionTextRenderFilter(new Rectangle(llx,lly,urx,ury));
    FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);
    
    String result = PdfTextExtractor.GetTextFromPage(pdfReader, i, strategy);
    Console.WriteLine(result);
    
    0 讨论(0)
  • 2021-02-04 20:48

    iText's RegionTextRenderFilter is precisely what you're looking for.

    So you want something like this (forgive my Java, but it should be trivial to translate):

    PdfReader reader = new PdfReader(path);
    
    FilteredTextExtractionStrategy regionFilter = 
      new FilteredTextExtractionStrategy( new SimpleTextExtrationStrategy, 
                                          new RegionTextRenderFilter( someRect ) );
    String regionText = PdfTextExtractor.getTextFromPage(reader, 0, regionFilter );
    
    0 讨论(0)
  • 2021-02-04 20:52

    It's not open source, but hopefully this helps you (and potentially anyone else using ABCPDF!)

    I did this earlier today by looping over the available fields in the PDF. This means that the PDF you are using needs to be created properly and you need to know the field name that you want to get the text for (you could work this out by adding a breakpoint and looping through the available fields).

    WebSupergoo.ABCpdf6.Doc newPDF = new WebSupergoo.ABCpdf6.Doc();
    newPDF.Read("existing_file.pdf");
    
    foreach ( WebSupergoo.ABCpdf6.Objects.Field field in newPDF.Form.Fields )
    {
        if ( field.Name == "Text1" )
        {
            // update "Text1"
            field.Value = "new value for Text1";
        }
    }
    
    newPDF.Save("new_file.pdf");
    
    newPDF.Clear();
    

    In the example, "Text1" is the name of the field that is being updated. Note I am also providing an example for saving out updated field(s).

    Hopefully that at least gives you an idea of how to approach this problem.

    0 讨论(0)
提交回复
热议问题