從javapdf轉excel的角度看PDF數據抽取

一、PDF數據抽取的基礎認識

從PDF中抽取數據是PDF處理中的重要部分，因為PDF格式的可讀性強和顏值高，默認情況下無法直接修改和複製。PDF庫通常以文本塊或文本流的形式提供抽取功能，這些數據需要進一步解析和格式化才能導出。PDF數據抽取的目的是轉換和重用PDF中的數據，如將PDF轉換成Excel等格式。

Java語言是適用於各種類型的應用程序開發的功能強大的編程語言。同時由於Java本身擁有大量的類庫和豐富的第三方類庫支持，因此在PDF數據抽取上Java有著廣泛的應用和深入的研究。

二、PDF數據抽取的實現

PDF 抽取需要解決兩個基本問題：如何定位和提取任意的 PDF 網格，以及如何將提取的數據格式化為可讀的文本。 PDF 數據抽取分為文本抽取和表格抽取兩個部分。

1. 文本抽取

文本抽取通常用於提取 PDF 的摘要部分，如頁面標題和作者名稱，或對頁面的全文內容進行提取。可以通過基於流和基於位置的策略進行，但如果PDF中存在多個嵌入式字體，這種方法的效果就不盡如人意。

PDFBox是一種基於Java的開源庫，可以實現文本抽取功能，依賴於PdfTextStripper。以下是一個使用PDFBox進行文本抽取的示例代碼：

    
    PDDocument document = PDDocument.load(new File("pdf.pdf"));
    PDFTextStripper textStripper = new PDFTextStripper();
    String content = textStripper.getText(document);

2. 表格抽取

表格抽取通常是將PDF中的表格轉換為Excel中的格式。PDF中的表格往往是以文字和線框的形式呈現。表格抽取需要將PDF中某個區域內的單元格識別和分離出來，然後建立表格模型，進一步將表格模型轉換為目標格式。

一種常見的表格抽取方法是基於區域的方法，即將頁面劃分為不同的區域，每個區域包含表格單元以及與表格單元相關的字元和線條。之後需要將每個區域提取為一個表格模型，再將多個模型合併為一個表格。TET（Text Extraction Toolkit）正是一個基於區域的表格抽取工具，在Java中通過iText庫實現。以下是使用TET進行表格抽取的示例代碼：

    
    import java.io.IOException;
    import java.util.List;
    import javax.xml.parsers.ParserConfigurationException;
    import org.fit.pdfdom.PDFDomTree;
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.pdmodel.PDPage;
    import org.pdfbox.util.PDFTextStripperByArea;
    import org.xml.sax.SAXException;

    public class PDFTableExtract {

        public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

            PDFDomTree domTree = new PDFDomTree();
            PDDocument document = PDDocument.load(new File("pdf.pdf"));

            List allPages = document.getDocumentCatalog().getAllPages();
            for (int pageNum = 0; pageNum  < allPages.size(); pageNum++) {

                PDPage page = (PDPage)allPages.get(pageNum);
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                // PDF中都是默認左下角為坐標原點，計算頂部和底部會比較麻煩
                // 工具類PDFLayoutTextStripperByArea可以方便地通過百分比計算
                float height = page.getMediaBox().getHeight();
                float width = page.getMediaBox().getWidth();

                stripper.addRegion("cell1", new RectangleF(0.5f, height - 56f, left2, height - 14f, 3)));
                stripper.addRegion("cell2", new RectangleF(1.5f, height - 56f, left3, height - 14f, 3)));
                stripper.addRegion("cell3", new RectangleF(2.5f, height - 56f, left4, height - 14f, 3)));

                stripper.extractRegions(page);

                ResultSet rs = createRecord();
                rs.moveToInsertRow();
                rs.updateString("ID", "1");
                rs.updateString("CELL1", stripper.getTextForRegion("cell1"));
                rs.updateString("CELL2", stripper.getTextForRegion("cell2"));
                rs.updateString("CELL3", stripper.getTextForRegion("cell3"));
                rs.insertRow();
            }

            document.close();
        }
    }

三、PDF轉Excel實現

PDF 提取工具可以將抽取文本或表格到 XML 或 JSON 文件中，然後使用 Apache POI 等 API 依據 XML 或 Json 文件將數據轉成 Excel。 Apache POI（Poor Obfuscation Implementation）是一個開源的 Java API，可以處理 Microsoft Office 格式的文檔。下面是將PDF轉換成Excel的示例代碼：

    
    import java.io.FileNotFoundException;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.util.Iterator;
    import org.apache.poi.hssf.usermodel.HSSFCell;
    import org.apache.poi.hssf.usermodel.HSSFRow;
    import org.apache.poi.hssf.usermodel.HSSFSheet;
    import org.apache.poi.hssf.usermodel.HSSFWorkbook;
    import org.json.JSONException;
    import com.itextpdf.text.DocumentException;
    import org.json.JSONArray;
    import org.json.JSONObject;
    
    public class PDF2Excel {
        @SuppressWarnings("resource")
        public static void main(String[] args) throws IOException, JSONException, DocumentException {
    
            //讀取json數據
            String filepath = "data.json";
            String jsonData = new String(Files.readAllBytes(Paths.get(filepath)));
            JSONObject json = new JSONObject(jsonData);
            JSONArray jsonArray = json.getJSONArray("rows");
            
            //創建Excel文檔對象並設置基本屬性
            HSSFWorkbook workbook = new HSSFWorkbook();
            HSSFSheet sheet = workbook.createSheet("sheet1");
            Iterator rowIter = jsonArray.iterator();
    
            int rowNum = 0;
            while (rowIter.hasNext()) {
                HSSFRow row = sheet.createRow(rowNum);
                JSONObject rowJson = (JSONObject) rowIter.next();
                Iterator cellIter = rowJson.keys();
                int cellNum = 0;
                while (cellIter.hasNext()) {
                    String cellData = rowJson.getString((String) cellIter.next());
                    HSSFCell cell = row.createCell(cellNum++);
                    cell.setCellValue(cellData);
                }
                rowNum++;
            }
    
            FileOutputStream fileOutputStream = new FileOutputStream("output.xls");
            workbook.write(fileOutputStream);

        }
    }

四、結論

PDF在很多場合用起來都很方便，但是需要將PDF轉換成Excel等格式進行數據分析，各種PDF轉換工具就顯得尤為重要。本文藉助Java語言以及PDFBox,iText和POI等Java類庫，分別實現了文本和表格的PDF數據抽取並最終將PDF轉換成Excel的過程。希望對使用Java進行PDF數據抽取的開發者有所幫助。

原創文章，作者：GCYD，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/131470.html