CRNN网络详解

一、CRNN概述

CRNN（Convolutional Recurrent Neural Network）是由全卷积神经网络（FCN）和循环神经网络（RNN）结合而成，主要应用于图像与文本中的场景文本识别（Scene Text Recognition，STR）任务。CRNN网络结合了CNN网络能够提取高维特征的优点和RNN网络能够捕捉上下文关系的优点，因此在文本识别任务中取得了优秀的表现。

二、CRNN结构

CRNN网络结构包括卷积层（Convolutional Layer）、循环层（Recurrent Layer）和转录层（Transcription Layer）三个部分。

1.卷积层

卷积层负责从原始图像中提取特征。一般的，训练好的卷积层包括了数个卷积层和池化层，其中卷积层负责提取特征，池化层负责保证计算速度和空间不变性。最后在特征图上进行特征选择，删去无用特征。

import torch.nn as nn
import torch

class Conv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, dilation=1, groups=1,
                 norm_layer=None, activation_layer=None, bias=True):
        super(Conv, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if activation_layer is None:
            activation_layer = nn.ReLU(inplace=True)
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias=bias)
        self.bn = norm_layer(out_channels)
        self.act = activation_layer

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.act(x)
        return x

2.循环层

循环层负责对特征序列进行处理。由于文本向量是一个序列，需要一种能够捕捉序列信息的算法。RNN即循环神经网络，它的输出状态一方面与上一次的状态相关，一方面与当前的输入相关。

class BidirectionalLSTM(nn.Module):

    def __init__(self, nIn, nHidden, nOut):
        super(BidirectionalLSTM, self).__init__()

        self.rnn = nn.LSTM(nIn, nHidden, bidirectional=True)
        self.embedding = nn.Linear(nHidden * 2, nOut)

    def forward(self, input):
        recurrent, _ = self.rnn(input)
        T, b, h = recurrent.size()
        t_rec = recurrent.view(T*b, h)

        output = self.embedding(t_rec)  # [T * b, nOut]
        output = output.view(T, b, -1)

        return output

3.转录层

转录层负责将特征图转化为文本。具体来说是对卷积层和循环层处理后为一个序列的特征图进行转录。转录可以采用CTC算法（Connectionist Temporal Classification）。

class Transcription(nn.Module):
    def __init__(self, n_class):
        super(Transcription, self).__init__()

        self.fc = nn.Linear(512, n_class)

    def forward(self, x):
        T = x.size(0)
        x = x.view(T, -1)
        x = self.fc(x)

        return x

三、CRNN参数设置

CRNN网络参数设置如下：

n_class = 37  # 26个字母+数字+一些特殊符号
input_height = 32  # 图像高度
n_channel = 1  # 图像通道数，黑白图像为1
n_hidden = 256  # 循环层隐藏单元个数

四、CRNN训练

CRNN网络的训练需要准备训练集和验证集数据，并按照批次大小（batch size）进行训练。

from torchvision import transforms, datasets
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Grayscale(),  # 将彩色图像转为灰度图像
    transforms.Resize((input_height, 100)),  # 将图像高度设置为32，宽度压缩到100
    transforms.ToTensor(),  # 将图像转化为Tensor
])

train_dataset = datasets.ImageFolder(root="./train", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

test_dataset = datasets.ImageFolder(root="./test", transform=transform)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=True)

crnn = CRNN(n_channel, n_hidden, n_class)

optimizer = torch.optim.Adam(crnn.parameters(), lr=0.0001)

loss_fn = nn.CTCLoss()

num_epoch = 20

for epoch in range(num_epoch):
    train_loss = 0.0
    for idx, (image, label) in enumerate(train_loader):
        image = image.to(device)
        label = label.to(device)
        output = crnn(image)
        output_size = torch.IntTensor([output.size(0)] * output.size(1))
        loss = loss_fn(output, label, output_size, label.size(0))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    print("Epoch: ", epoch, "Loss: ", train_loss/len(train_loader))

五、CRNN识别

CRNN网络可以通过输入待识别的图像，得到对应的文本。代码如下：

image_path = "./test/1.png"
image = Image.open(image_path)
image = transform(image).unsqueeze(0)
image = image.to(device)

crnn.eval()
output = crnn(image)
output_argmax = output.argmax(dim=2).squeeze()
predicted_sentence = convert_to_text(output_argmax, id_to_char)
print("Predicted sentence: ", predicted_sentence)

原创文章，作者：小蓝，如若转载，请注明出处：https://www.506064.com/n/275820.html