How to extract the title of a PDF document from within a script for renaming?

前端 未结 6 1661
长情又很酷
长情又很酷 2021-02-01 21:34

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. \"aluminum carbona

6条回答
  •  挽巷
    挽巷 (楼主)
    2021-02-01 21:53

    Installing the package

    This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.

    On Windows, first make sure you have a recent version of pip using the shell command:

    python -m pip install -U pip
    

    On Linux:

    pip install -U pip
    

    On both platforms, install then the pdfrw package using

    pip install pdfrw
    

    The code

    I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:

    import os
    from pdfrw import PdfReader
    
    path = r'C:\Users\YANN\Desktop'
    
    
    def renameFileToPDFTitle(path, fileName):
        fullName = os.path.join(path, fileName)
        # Extract pdf title from pdf file
        newName = PdfReader(fullName).Info.Title
        # Remove surrounding brackets that some pdf titles have
        newName = newName.strip('()') + '.pdf'
        newFullName = os.path.join(path, newName)
        os.rename(fullName, newFullName)
    
    
    for fileName in os.listdir(path):
        # Rename only pdf files
        fullName = os.path.join(path, fileName)
        if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
            continue
        renameFileToPDFTitle(path, fileName)
    

提交回复
热议问题