发表新帖

发表新帖

Extracting bold text from Resumes( .Docx,.Doc,PDF) using Python

前端未结

关注

 1  1220

攒了一身酷

I have thousands of resumes in any format like word with .doc, .docx and pdf.

I want to extract bold text from these documents using textract library in python. is t

相关标签:

1条回答

一生所求

2021-01-15 18:30
An easy solution would be to use the python-docx package. install the package using ( !pip install python-docx )

You'll need to convert your pdf files to .docx . you can do that using any online pdf to docx converter or use python to do that.

the following lines of codes will extract all bold and italic contents of your resumes and save them in a dictionary called boltalic_Dict. you may retrieve either later on.
```
from docx import *

document = Document('path_to_your_files')
bolds=[]
italics=[]
for para in document.paragraphs:
    for run in para.runs:
        if run.italic :
            italics.append(run.text)
        if run.bold :
            bolds.append(run.text)

boltalic_Dict={'bold_phrases':bolds,
              'italic_phrases':italics}
```
I hope this helps.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题