Python pytesseract 全能OCR庫

隨著計算機視覺和深度學習技術的不斷發展，光學字元識別（OCR）已經逐漸成為了一個重要的方向。Python pytesseract 是一個開源程序庫，它提供了一種在圖像或PDF中識別文本的簡單方法。

一、安裝

安裝 pytesseract 程序庫可以使用 pip 命令：

pip install pytesseract

在此之前，需要安裝 pytesseract 依賴項 Tesseract OCR。可以在以下網址下載並安裝：

https://github.com/UB-Mannheim/tesseract/wiki

二、基本用法

在安裝了 pytesseract 並且下載並安裝了 Tesseract OCR 後，就可以使用 pytesseract 識別圖片中的文字了。下面是一個用例：

# 導入 pytesseract 庫
import pytesseract
# 導入 Image 模塊
from PIL import Image

# 打開圖片
image = Image.open('example.png')

# 識別圖片中的文字
text = pytesseract.image_to_string(image, lang='chi_sim')

# 列印出圖片中的文字內容
print(text)

在上面的代碼中，首先用 PIL 庫中的 Image 模塊打開了一張名為『example.png』的圖片。其次，利用 pytesseract 庫中的 image_to_string() 函數，將圖片中的文字轉化為字元串。最後，列印出文字內容。

三、設置參數

在實際使用中， pytesseract 庫提供了許多有用的參數，用於優化識別功能。以下是一些常見的參數：

lang: 用於指定 OCR engine 使用的語言。例如『eng』表示英文，『chi_sim』表示簡體中文。
config: 用於設置 Tesseract OCR 的參數。比如『–psm 10』將告訴 Tesseract OCR 以單字元模式運行。
psm: 用於設置 Tesseract OCR 的頁面分割模式。在不同的圖片中，OCR 接受的文本量往往不同，通過設置分割模式，可以改善識別圖片中文字的質量。

下面是一個用例，它展示了如何使用以上的參數：

# 導入 pytesseract 庫
import pytesseract
# 導入 Image 模塊
from PIL import Image

# 打開圖片
image = Image.open('example2.png')

# 設置參數
custom_config = r'--psm 10'

# 識別圖片中的文字
text = pytesseract.image_to_string(image, lang='chi_sim', config=custom_config)

# 列印出圖片中的文字內容
print(text)

在上面的代碼中，使用了『–psm 10』這個 Tesseract OCR 參數，以單字元模式運行識別程序。這將改善 OCR 識別文本的準確性。

四、語言支持

pytesseract 支持眾多的語言，其中包括繁體中文，簡體中文，英語，法語，德語，西班牙語等。為了使用這些語言，需要在 image_to_string() 函數中設置 lang 參數。以下是一些常見語言的設置：

# 使用中文 OCR
text = pytesseract.image_to_string(img, lang='chs')

# 使用英文 OCR
text = pytesseract.image_to_string(img, lang='eng')

# 使用法語 OCR
text = pytesseract.image_to_string(img, lang='fra')

# 使用德語 OCR
text = pytesseract.image_to_string(img, lang='deu')

# 使用西班牙語 OCR
text = pytesseract.image_to_string(img, lang='spa')

五、總結

Python pytesseract 庫為開發者提供了一個強大的 OCR 引擎，並幫助將 OCR 技術應用於 Python 中。通過設置參數和引入不同語言，可以更好的適應不同的 OCR 識別場景。因此，它成為很多開發者首選的 OCR 庫。

原創文章，作者：LAVEJ，如若轉載，請註明出處：https://www.506064.com/zh-tw/n/318150.html