How to extract the url in hyperlinks from a docx file using python

拥有回忆 提交于 2019-12-23 02:59:04


I've been trying to find out how to get urls from a docx file using python, but failed to find anything, i've tried python-docx, and python-docx2txt, but python-docx only seems to extract the text, while python-docx2txt is able to extract the text from the hyperlink but not the urls themselves.


I am a beginner on Python and have an assignment to use Python to change each hyperlink in a .docx document. Thanks to Kiran's code which gave me hints to do a few guess, trial and errors and finally get it working. Here is the code I have and like to share with other beginners.

# python to change docx URL hyperlinks:
### see:

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

print(" This program changes the hyperlinks detected in a word .docx file \n")

docx_file=input(" Pls input docx filename (without .docx): ")

document = Document(docx_file + ".docx")

rels = document.part.rels

for rel in rels:
   if rels[rel].reltype == RT.HYPERLINK:
      print("\n Origianl link id -", rel, "with detected URL: ", rels[rel]._target)
      new_url=input(" Pls input new URL: ")

out_file=docx_file + "-out.docx"

print("\n File saved to: ", out_file)

Thank you, Lapyiu Ho


I solved it using the following code to print the hyperlink content from docx

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

document = Document('test.docx')
rels = document.part.rels

def iter_hyperlink_rels(rels):
    for rel in rels:
        if rels[rel].reltype == RT.HYPERLINK:
            yield rels[rel]._target      



you can use wps save as .hml file,then operate file


I'm late to this party, but if you want something that pulls all the links out of .docx files and makes a spreadsheet of them (or returns a list of them), I have a script that might do that for you. It includes both the URL and the linked text, and you can feed it a whole folder if you want.

It uses BeautifulSoup and UnicodeCSV, both of which you can also grab from that same repo. Runs in Python3. Instructions at the top of the file. Handles non-ascii characters. Only tested on Mac and Ubuntu so far. Excel does not reliably import Unicode CSVs, though Google Drive does. Offer void() where prohibited.


def iter_hyperlink_rels(rels):
   for rel in rels:
      if rels[rel].reltype == RT.HYPERLINK:
         yield rels[rel]      

This would remove the error.

