PDFminer詳解

PDF文件可以是常見的個人文檔，例如簡歷、涉及法律或金融交易的文件、學術論文等。本文將對PDFminer庫及其功能進行詳細闡述，為開發人員提供關於PDF文件數據處理和提取的基礎知識。

一、PDFminer簡介

PDFminer是一個Python工具包，它旨在從PDF文檔中提取文本內容和元數據。它不僅能夠解析PDF文件，還可以追蹤內部鏈接、根據塊或行提取文本等等。PDFminer適用於Python 2.x和3.x版本，並且可以在各種操作系統上運行。

下面為你展示如何安裝PDFminer：

pip install pdfminer.six

PDFminer具有許多功能，包括：解析PDF文檔結構、提取文本、查找對象、處理字體和圖像等。我們將分別討論這些功能。

二、PDFminer的功能

1.解析PDF文檔結構

PDFminer從PDF文檔結構解析文本時，可以通過指定解析方式控制文本解析。然後，你可以選擇將文本轉換成XML或HTML格式。這意味著實際上沒有必要將PDF文件轉換為純文本文件。

下面的代碼詳細介紹了如何使用PDFminer解析PDF文件結構，並將其保存為XML格式：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from io import StringIO

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create PDFResourceManager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()

    device = XMLConverter(rsrcmgr, output_string, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Process each page contained in the document.
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)

    print(output_string.getvalue())

上面的代碼將解析PDF文件example.pdf並將其以XML格式顯示在控制台。

2.提取PDF文本

PDFminer的主要功能之一是提取PDF文本。我們可以使用TextConverter將其轉換為純文本格式，並在Python應用程序中使用它。

下面是提取文本的示例代碼：

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create PDFResourceManager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()

    device = TextConverter(rsrcmgr, output_string, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Process each page contained in the document.
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)

    print(output_string.getvalue())

上面的代碼將解析pdf文件example.pdf並將文本提取到output_string對象中。示例使用TextConverter對象將提取的文本轉換為純文本格式。可以將輸出寫入文件或輸出到控制台中。

3.查找對象

PDFminer還允許我們使用PDFDocument中的方法來查找特定對象，並從PDF文件中提取信息。

下面的示例代碼演示如何查找PDF文件中的所有鏈接：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create PDFResourceManager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()

    for page in PDFPage.create_pages(document):
        resources = page.resources
        if resources:
            for obj in dict(resources).values():
                if obj and obj.resolve() and obj.resolve().get('Subtype') == '/Link':
                    print(obj)

上面的代碼將解析pdf文件example.pdf，並從所有頁面中提取鏈接。它使用PDFParser查找document對象，之後通過resources屬性中的鏈接解析PDF文件。

4.處理字體和圖像

PDF文檔常常包含字體和圖像，PDFminer可以輕鬆解析這些對象。可以使用PDFResourceManager實現字體和圖像處理。在下面的示例代碼中，PDF文件中的所有圖像和字體文件都會被提取到output.txt文件中：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.image import ImageWriter

with open('example.pdf', 'rb') as in_file, \
        open('output.txt', 'wb') as out_file:
    parser = PDFParser(in_file)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create PDFResourceManager object that stores shared resources.
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()

    # Create ImageWriter object to write extracted images to file.
    device = ImageWriter(out_file, 'outputdir')

    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Process each page contained in the document.
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)

        # Get all images from the page (if any).
        images = device.imagewriter.image_list
        for img in images:
            print("Found an image with size: {}x{} in file: {}".format(
                img.width, img.height, img.path))
        device.imagewriter.reset()

        # Get all fonts from the page (if any).
        fonts = set(x.fontname for x in device.fontmap.values())
        print("Found the following fonts:\n{}".format(fonts))

在上面的示例代碼中，我們使用ImageWriter將PDF中的所有圖像提取到outputdir目錄中。我們還訪問了device.fontmap屬性並捕獲了字體文件名。將fontmap保存為字典對象，其中字典的鍵是字體描述字元串和字體文件名的元組，值是在PDF中使用該字體的字體對象。

三、總結

PDFminer是可用於處理PDF文檔的Python庫。它能夠解析PDF文檔，提取文本和元數據，並查找PDF文檔中的對象及其屬性。此外，PDFminer還能夠處理PDF文件中的字體和圖像。這些功能使開發人員們能夠更方便地從PDF文檔中提取信息，達到更高效的數據處理效果。

原創文章，作者：HVMX，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/136592.html