highlighting words in an docx file using python-docx gives incorrect results

问题

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:

from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re

doc = Document(docxFileName)

negativList = ["king", "children", "lived", "fire"]  # some examples

for paragraph in doc.paragraphs:
    for target in negativList:
        if target in paragraph.text:  # it is worth checking in detail ...

            currRuns = copy.copy(paragraph.runs)   # deep copy as we delete/clear the object
            paragraph.runs.clear()

            for run in currRuns:
                if target in run.text:
                    words = re.split('(\W)', run.text)  # split into words in order to be able to color only one
                    for word in words:
                        if word == target:
                            newRun = paragraph.add_run(word)
                            newRun.font.highlight_color = WD_COLOR_INDEX.PINK
                        else:
                            newRun = paragraph.add_run(word)
                            newRun.font.highlight_color = None
                else: # our target is not in it so we add it unchanged
                    paragraph.runs.append(run)

doc.save('output.docx')

As example I am using this text (in a word docx file):

CHAPTER 1

Centuries ago there lived --

"A king!" my little readers will say immediately.

No, children, you are mistaken. Once upon a time there was a piece of wood. It was not an expensive piece of wood. Far from it. Just a common block of firewood, one of those thick, solid logs that are put on the fire in winter to make cold rooms cozy and warm.

There are multiple problems with my code:

1) The first sentence works but the second sentence is in twice. Why?

2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?

3) I loose the terminal "--"

4) In the highlighted last paragraph the "cozy and warm" is missing ...

What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?

回答1:

You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.

The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.

There are some complications though, which is what makes it challenging:

There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.

This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).
Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).

If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.

So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.

To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)

Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.

Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.

回答2:

I faced a similar issue where I was supposed to highlight a set of words in a document. I modified certain parts of the OP's code and now I am able to highlight the selected words correctly.

As OP said in the comments: paragraph.runs.clear() was changed to paragraph.clear(). And I added a few lines to the following part of the code:

 else:
    paragraph.runs.append(run)

to get this:

 else:
    oldRun = paragraph.add_run(run.text)
    if oldRun.text in spell_errors:
        oldRun.font.highlight_color = WD_COLOR_INDEX.YELLOW

While iterating over the currRuns, we extract the text content of the run and add it to the paragraph, so we need to highlight those words again.

来源：https://stackoverflow.com/questions/55298385/highlighting-words-in-an-docx-file-using-python-docx-gives-incorrect-results

标签

python

ms-word

python-docx