How to extract the title of a PDF document from within a script for renaming?

前端 未结 6 1657
长情又很酷
长情又很酷 2021-02-01 21:34

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. \"aluminum carbona

相关标签:
6条回答
  • 2021-02-01 21:44

    What you need is a library that can actually read PDF files. For example pdfrw:

    In [8]: from pdfrw import PdfReader
    
    In [9]: reader = PdfReader('example.pdf')
    
    In [10]: reader.Info.Title
    Out[10]: 'Example PDF document'
    
    0 讨论(0)
  • 2021-02-01 21:52

    You can look at only the metadata using a ghostscript tool pdf_info.ps. It used to ship with ghostscript but is still available at https://r-forge.r-project.org/scm/viewvc.php/pkg/inst/ghostscript/pdf_info.ps?view=markup&root=tm

    0 讨论(0)
  • 2021-02-01 21:53

    Installing the package

    This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.

    On Windows, first make sure you have a recent version of pip using the shell command:

    python -m pip install -U pip
    

    On Linux:

    pip install -U pip
    

    On both platforms, install then the pdfrw package using

    pip install pdfrw
    

    The code

    I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:

    import os
    from pdfrw import PdfReader
    
    path = r'C:\Users\YANN\Desktop'
    
    
    def renameFileToPDFTitle(path, fileName):
        fullName = os.path.join(path, fileName)
        # Extract pdf title from pdf file
        newName = PdfReader(fullName).Info.Title
        # Remove surrounding brackets that some pdf titles have
        newName = newName.strip('()') + '.pdf'
        newFullName = os.path.join(path, newName)
        os.rename(fullName, newFullName)
    
    
    for fileName in os.listdir(path):
        # Rename only pdf files
        fullName = os.path.join(path, fileName)
        if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
            continue
        renameFileToPDFTitle(path, fileName)
    
    0 讨论(0)
  • 2021-02-01 21:53

    Once you have installed it, open the app and go to the Download folder. You will see your downloaded files there. Just long press the file you wish to rename and the Rename option will appear at the bottom.

    0 讨论(0)
  • 2021-02-01 22:05

    You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :

    [{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`
    

    Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):

    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    import os
    
    start = "0000"
    
    def convert(var):
        while len(var) < 4:
            var = "0" + var
    
        return var
    
    for i in range(1,3622):
        var = str(i)
        var = convert(var)
        file_name = "a" + var + ".pdf"
        fp = open(file_name, 'rb')
        parser = PDFParser(fp)
        doc = PDFDocument(parser)
        fp.close()
        metadata = doc.info  # The "Info" metadata
        print metadata
        metadata = metadata[0]
        for x in metadata:
            if x == "Title":
                new_name = metadata[x] + ".pdf"
                os.rename(file_name,new_name)
    
    0 讨论(0)
  • 2021-02-01 22:06

    Building on Ciprian Tomoiagă's suggestion of using pdfrw, I've uploaded a script which also:

    • renames files in sub-directories
    • adds a command-line interface
    • handles when file name already exists by appending a random string
    • strips any character which is not alphanumeric from the new file name
    • replaces non-ASCII characters (such as á è í ò ç...) for ASCII (a e i o c) in the new file name
    • allows you to set the root dir and limit the length of the new file name from command-line
    • show a progress bar and, after the script has finished, show some statistics
    • does some error handling

    As TextGeek mentioned, unfortunately not all files have the title metadata, so some files won't be renamed.

    Repository: https://github.com/favict/pdf_renamefy

    Usage:

    After downloading the files, install the dependencies by running pip:

    $pip install -r requirements.txt
    

    and then to run the script:

    $python -m renamefy <directory> <filename maximum length>
    

    ...in which directory is the full path you would like to look for PDF files, and filename maximum length is the length at which the filename will be truncated in case the title is too long or was incorrectly set in the file.

    Both parameters are optional. If none is provided, the directory is set to the current directory and filename maximum length is set to 120 characters.

    Example:

    $python -m renamefy C:\Users\John\Downloads 120
    

    I used it on Windows, but it should work on Linux too.

    Feel free to copy, fork and edit as you see fit.

    0 讨论(0)
提交回复
热议问题