一、CRNN概述
CRNN(Convolutional Recurrent Neural Network)是由全卷積神經網路(FCN)和循環神經網路(RNN)結合而成,主要應用於圖像與文本中的場景文本識別(Scene Text Recognition,STR)任務。CRNN網路結合了CNN網路能夠提取高維特徵的優點和RNN網路能夠捕捉上下文關係的優點,因此在文本識別任務中取得了優秀的表現。
二、CRNN結構
CRNN網路結構包括卷積層(Convolutional Layer)、循環層(Recurrent Layer)和轉錄層(Transcription Layer)三個部分。
1.卷積層
卷積層負責從原始圖像中提取特徵。一般的,訓練好的卷積層包括了數個卷積層和池化層,其中卷積層負責提取特徵,池化層負責保證計算速度和空間不變性。最後在特徵圖上進行特徵選擇,刪去無用特徵。
import torch.nn as nn
import torch
class Conv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, dilation=1, groups=1,
norm_layer=None, activation_layer=None, bias=True):
super(Conv, self).__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
if activation_layer is None:
activation_layer = nn.ReLU(inplace=True)
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias=bias)
self.bn = norm_layer(out_channels)
self.act = activation_layer
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.act(x)
return x
2.循環層
循環層負責對特徵序列進行處理。由於文本向量是一個序列,需要一種能夠捕捉序列信息的演算法。RNN即循環神經網路,它的輸出狀態一方面與上一次的狀態相關,一方面與當前的輸入相關。
class BidirectionalLSTM(nn.Module):
def __init__(self, nIn, nHidden, nOut):
super(BidirectionalLSTM, self).__init__()
self.rnn = nn.LSTM(nIn, nHidden, bidirectional=True)
self.embedding = nn.Linear(nHidden * 2, nOut)
def forward(self, input):
recurrent, _ = self.rnn(input)
T, b, h = recurrent.size()
t_rec = recurrent.view(T*b, h)
output = self.embedding(t_rec) # [T * b, nOut]
output = output.view(T, b, -1)
return output
3.轉錄層
轉錄層負責將特徵圖轉化為文本。具體來說是對卷積層和循環層處理後為一個序列的特徵圖進行轉錄。轉錄可以採用CTC演算法(Connectionist Temporal Classification)。
class Transcription(nn.Module):
def __init__(self, n_class):
super(Transcription, self).__init__()
self.fc = nn.Linear(512, n_class)
def forward(self, x):
T = x.size(0)
x = x.view(T, -1)
x = self.fc(x)
return x
三、CRNN參數設置
CRNN網路參數設置如下:
n_class = 37 # 26個字母+數字+一些特殊符號
input_height = 32 # 圖像高度
n_channel = 1 # 圖像通道數,黑白圖像為1
n_hidden = 256 # 循環層隱藏單元個數
四、CRNN訓練
CRNN網路的訓練需要準備訓練集和驗證集數據,並按照批次大小(batch size)進行訓練。
from torchvision import transforms, datasets
from torch.utils.data import DataLoader
transform = transforms.Compose([
transforms.Grayscale(), # 將彩色圖像轉為灰度圖像
transforms.Resize((input_height, 100)), # 將圖像高度設置為32,寬度壓縮到100
transforms.ToTensor(), # 將圖像轉化為Tensor
])
train_dataset = datasets.ImageFolder(root="./train", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_dataset = datasets.ImageFolder(root="./test", transform=transform)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=True)
crnn = CRNN(n_channel, n_hidden, n_class)
optimizer = torch.optim.Adam(crnn.parameters(), lr=0.0001)
loss_fn = nn.CTCLoss()
num_epoch = 20
for epoch in range(num_epoch):
train_loss = 0.0
for idx, (image, label) in enumerate(train_loader):
image = image.to(device)
label = label.to(device)
output = crnn(image)
output_size = torch.IntTensor([output.size(0)] * output.size(1))
loss = loss_fn(output, label, output_size, label.size(0))
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
print("Epoch: ", epoch, "Loss: ", train_loss/len(train_loader))
五、CRNN識別
CRNN網路可以通過輸入待識別的圖像,得到對應的文本。代碼如下:
image_path = "./test/1.png"
image = Image.open(image_path)
image = transform(image).unsqueeze(0)
image = image.to(device)
crnn.eval()
output = crnn(image)
output_argmax = output.argmax(dim=2).squeeze()
predicted_sentence = convert_to_text(output_argmax, id_to_char)
print("Predicted sentence: ", predicted_sentence)
原創文章,作者:小藍,如若轉載,請註明出處:https://www.506064.com/zh-tw/n/275820.html