How to read contents of an Table in MS-Word file Using Python?

后端 未结 3 1636
栀梦
栀梦 2020-11-27 14:11

How can I read and process contents of every cell of a table in a DOCX file?

I am using Python 3.2 on Windows 7 and PyWin32 to access the MS-Word Document.

I

相关标签:
3条回答
  • 2020-11-27 14:51

    Here is what works for me in Python 2.7:

    import win32com.client as win32
    word = win32.Dispatch("Word.Application")
    word.Visible = 0
    word.Documents.Open("MyDocument")
    doc = word.ActiveDocument
    

    To see how many tables your document has:

    doc.Tables.Count
    

    Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:

    table = doc.Tables(1)
    

    To select a cell:

    table.Cell(Row = 1, Column= 1)
    

    To get its content:

    table.Cell(Row =1, Column =1).Range.Text
    

    Hope that this helps.

    EDIT:

    An example of a function that returns Column index based on its heading:

    def Column_index(header_text):
    for i in range(1 , table.Columns.Count+1):
        if table.Cell(Row = 1,Column = i).Range.Text == header_text:
            return i
    

    then you can access the cell you want this way for example:

    table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text
    
    0 讨论(0)
  • 2020-11-27 15:04

    Jumping in rather late in life, but thought I'd put this out anyway: Now (2015), you can use the pretty neat doc python library: https://python-docx.readthedocs.org/en/latest/. And then:

    from docx import Document
    
    wordDoc = Document('<path to docx file>')
    
    for table in wordDoc.tables:
        for row in table.rows:
            for cell in row.cells:
                print cell.text
    
    0 讨论(0)
  • 2020-11-27 15:08

    I found a simple code snippet on a blog Reading Table Contents Using Python by etienne

    The great thing about this is that you don't need any non-standard python libraries installed.

    The format of a docx file is described at Open Office XML.

    import zipfile
    import xml.etree.ElementTree
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'
    TABLE = WORD_NAMESPACE + 'tbl'
    ROW = WORD_NAMESPACE + 'tr'
    CELL = WORD_NAMESPACE + 'tc'
    
    with zipfile.ZipFile('<path to docx file>') as docx:
        tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))
    
    for table in tree.iter(TABLE):
        for row in table.iter(ROW):
            for cell in row.iter(CELL):
                print ''.join(node.text for node in cell.iter(TEXT))
    
    0 讨论(0)
提交回复
热议问题