How to know the Image or Picture Location while parsing MS Word Doc in java using apache poi

前端 未结 2 1920
暗喜
暗喜 2021-01-16 23:57
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
List picturesList = wordDoc.getPicturesTable().getAllPictures();
相关标签:
2条回答
  • 2021-01-17 00:26

    You Should add PicturesSourceClass

    public class PicturesSource {

    private PicturesTable picturesTable;
    private Set<Picture> output = new HashSet<Picture>();
    private Map<Integer, Picture> lookup;
    private List<Picture> nonU1based;
    private List<Picture> all;
    private int pn = 0;
    
    public PicturesSource(HWPFDocument doc) {
        picturesTable = doc.getPicturesTable();
        all = picturesTable.getAllPictures();
    
    
        lookup = new HashMap<Integer, Picture>();
        for (Picture p : all) {
            lookup.put(p.getStartOffset(), p);
        }
    
    
        nonU1based = new ArrayList<Picture>();
        nonU1based.addAll(all);
        Range r = doc.getRange();
        for (int i = 0; i < r.numCharacterRuns(); i++) {
            CharacterRun cr = r.getCharacterRun(i);
            if (picturesTable.hasPicture(cr)) {
                Picture p = getFor(cr);
                int at = nonU1based.indexOf(p);
                nonU1based.set(at, null);
            }
        }
    }
    
    
    private boolean hasPicture(CharacterRun cr) {
        return picturesTable.hasPicture(cr);
    }
    
    private void recordOutput(Picture picture) {
        output.add(picture);
    }
    
    private boolean hasOutput(Picture picture) {
        return output.contains(picture);
    }
    
    private int pictureNumber(Picture picture) {
        return all.indexOf(picture) + 1;
    }
    
    public Picture getFor(CharacterRun cr) {
        return lookup.get(cr.getPicOffset());
    }
    
    
    private Picture nextUnclaimed() {
        Picture p = null;
        while (pn < nonU1based.size()) {
            p = nonU1based.get(pn);
            pn++;
            if (p != null) return p;
        }
        return null;
    }
    

    }

    0 讨论(0)
  • 2021-01-17 00:27

    You're getting at the pictures the wrong way, which is why you're not finding any positions!

    What you need to do is process each CharacterRun of the document in turn. Pass that to the PicturesTable, and check if the character run has a picture in. If it does, fetch back the picture from the table, and you know where in the document it belongs as you have the run it comes from

    At the simplest, it'd be something like:

    PicturesSource pictures = new PicturesSource(document);
    PicturesTable pictureTable = document.getPicturesTable();
    
    Range r = document.getRange();
    for(int i=0; i<r.numParagraphs(); i++) {
        Paragraph p = r.getParagraph(i);
        for(int j=0; j<p.numCharacterRuns(); j++) {
          CharacterRun cr = p.getCharacterRun(j);
          if (pictureTable.hasPicture(cr)) {
             Picture picture = pictures.getFor(cr);
             // Do something useful with the picture
          }
        }
    }
    

    You can find a good example of doing this in the Apache Tika parser for Microsoft Word .doc, which is powered by Apache POI

    0 讨论(0)
提交回复
热议问题