一、VOC2012數據集的概述
VOC2012數據集是維持在英國牛津大學計算機科學系的Visual Object Classes 2012挑戰賽中使用的數據集,該數據集包含了20個物體類別。VOC2012數據集中的圖像是從互聯網上搜索到的,圖片解析度、內容、光照條件、背景、遮擋情況等都比較多樣,可以用來進行目標檢測、圖像分割、語義分割等任務。VOC2012數據集為各大計算機視覺研究者和工程師提供了一個很好的實驗平台,該數據集已經成為目標檢測、圖像分割等任務的評價標準之一。
二、VOC2012數據集的使用方式
使用VOC2012數據集進行目標檢測和圖像分割的流程如下:
1、下載VOC2012數據集:
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
2、解壓數據集:
tar -xvf VOCtrainval_11-May-2012.tar
3、解析標註文件,例如:Annotations、ImageSets、JPEGImages:
import xml.etree.ElementTree as ET def parse_annotation(annotation_path): tree = ET.parse(annotation_path) root = tree.getroot() image_path = root.find(text='filename').text size = root.find('size') width = int(size.find('width').text) height = int(size.find('height').text) depth = int(size.find('depth').text) objects = [] for obj in root.findall('object'): obj_struct = {} obj_struct['name'] = obj.find('name').text obj_struct['xmin'] = int(obj.find('bndbox/xmin').text) obj_struct['ymin'] = int(obj.find('bndbox/ymin').text) obj_struct['xmax'] = int(obj.find('bndbox/xmax').text) obj_struct['ymax'] = int(obj.find('bndbox/ymax').text) objects.append(obj_struct) return image_path, objects, width, height, depth
4、讀取JPEGImages和ImageSets文件夾中的圖像,並劃分為訓練集和測試集:
from os import listdir from os.path import isfile, join trainval_files = ['./VOCdevkit/VOC2012/ImageSets/Main/' + f.rstrip() for f in open('./VOCdevkit/VOC2012/ImageSets/Main/trainval.txt')] test_files = ['./VOCdevkit/VOC2012/ImageSets/Main/' + f.rstrip() for f in open('./VOCdevkit/VOC2012/ImageSets/Main/test.txt')] trainval_image_names = [] for trainval_file in trainval_files: with open(trainval_file) as f: trainval_image_names += f.read().splitlines() test_image_names = [] for test_file in test_files: with open(test_file) as f: test_image_names += f.read().splitlines() trainval_image_paths = ['./VOCdevkit/VOC2012/JPEGImages/' + n + '.jpg' for n in trainval_image_names] test_image_paths = ['./VOCdevkit/VOC2012/JPEGImages/' + n + '.jpg' for n in test_image_names]
三、VOC2012數據集中物體類別的介紹
以下是VOC2012數據集中包含的20個物體類別:
- Person:人類圖像
- Bird:鳥類圖像
- Car:汽車圖像
- Bus:公交車圖像
- Cow:奶牛圖像
- Sheep:綿羊圖像
- Aeroplane:飛機圖像
- Bicycle:自行車圖像
- Horse:馬圖像
- Motorbike:摩托車圖像
- Potted Plant:盆栽圖像
- Diningtable:餐桌圖像
- Cat:貓圖像
- Dog:狗圖像
- Boat:船舶圖像
- Train:火車圖像
- Sofa:沙發圖像
- Bottle:瓶子圖像
- Tv/Monitor:電視/顯示器圖像
四、VOC2012數據集應用實例
以下為使用VOC2012數據集進行目標檢測和語義分割的代碼示例,目錄結構如下:
── VOCdevkit └── VOC2012 ├── Annotations ├── ImageSets ├── JPEGImages
目標檢測代碼:
import torch import torchvision import os def get_transform(train): transforms = [] transforms.append(torchvision.transforms.ToTensor()) return torchvision.transforms.Compose(transforms) class VOCDataset(torch.utils.data.Dataset): def __init__(self, data_folder, split, transform=None): self.split = split.upper() assert self.split in {'TRAIN', 'TEST'} self.year = "2012" self.data_folder = data_folder self.transform = transform # VOC2012隻有20個類別,對應標籤編號為1-20 self.label_map = { "aeroplane": 1, "bicycle": 2, "bird": 3, "boat": 4, "bottle": 5, "bus": 6, "car": 7, "cat": 8, ... } if self.split == 'TRAIN': data_file = os.path.join(data_folder, 'ImageSets', 'Main', 'trainval.txt') else: data_file = os.path.join(data_folder, 'ImageSets', 'Main', 'test.txt') self.image_names = [] with open(data_file, 'r') as f: for line in f.readlines(): self.image_names.append(line.strip() + '.jpg') def __getitem__(self, index): image_path = os.path.join(self.data_folder, 'JPEGImages', self.image_names[index]) annotation_path = os.path.join(self.data_folder, 'Annotations', self.image_names[index].replace('.jpg', '.xml')) # parse the xml annotations file image_name, objects, image_width, image_height, image_depth = parse_annotation(annotation_path) # Convert everything into a torch.Tensor boxes = torch.zeros((len(objects), 4), dtype=torch.float32) labels = torch.zeros((len(objects)), dtype=torch.int64) difficulties = torch.zeros((len(objects)), dtype=torch.uint8) for i, object_dict in enumerate(objects): boxes[i, 0] = object_dict['xmin'] boxes[i, 1] = object_dict['ymin'] boxes[i, 2] = object_dict['xmax'] boxes[i, 3] = object_dict['ymax'] labels[i] = self.label_map[object_dict['name']] difficulties[i] = 0 image = Image.open(image_path).convert("RGB") # apply the transformations if self.transform is not None: image = self.transform(image) return image, boxes, labels, difficulties def __len__(self): return len(self.image_names) test_dataset = VOCDataset(data_folder='./VOCdevkit/VOC2012', split='TEST', transform=get_transform(train=False)) test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=8, shuffle=True) # load a pre-trained Faster R-CNN model model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) model.eval() def predict_my_image(image_path, confidence_threshold=0.5): # load the image image = Image.open(image_path).convert("RGB") # define the transforms transform = get_transform(train=False) # apply the transforms image = transform(image) # unsqueeze to add a batch dimension image = image.unsqueeze(0) # pass the image through the model with torch.no_grad(): predictions = model(image) # get the predictions tensor predictions = [{k: v.to(torch.device('cpu')) for k, v in t.items()} for t in predictions] # get all the boxes above the confidence threshold thresholded_predictions = [pred for pred in predictions if pred['scores'][0] > confidence_threshold] for i, prediction in enumerate(thresholded_predictions): print(f"Prediction {i}: {prediction['labels'][0]}, {prediction['scores'][0]}, {prediction['boxes'][0]}")
語義分割代碼:
from PIL import Image import numpy as np import os import torch import torchvision class VOCSemanticSegmentation(torch.utils.data.Dataset): """ Dataset for Semantic Segmentation on the Pascal VOC dataset """ def __init__(self, data_folder, split="TRAIN"): self.split = split.upper() assert self.split in {"TRAIN", "VAL", "TEST"} self.data_folder = data_folder self.image_name_list_file = os.path.join(data_folder, self.split + ".txt") with open(self.image_name_list_file, "r") as f: self.image_names = [x.strip() for x in f.readlines()] def __len__(self): return len(self.image_names) def __getitem__(self, index): # load image and target image_name = self.image_names[index] image_path = os.path.join(self.data_folder, "JPEGImages", image_name + ".jpg") target_path = os.path.join(self.data_folder, "SegmentationClass", image_name + ".png") image = Image.open(image_path).convert("RGB") target = Image.open(target_path) # convert target image tensor into array of labels target_array = np.array(target).astype(np.int32) labels = np.zeros_like(target_array) labels[np.where(target_array == 0)] = 0 labels[np.where(target_array == 128)] = 1 labels[np.where(target_array == 192)] = 2 labels[np.where(target_array == 255)] = 3 # convert everything into a torch.Tensor and return image = torchvision.transforms.functional.to_tensor(image) labels = torch.from_numpy(labels).long() return image, labels
總結
以上就是對VOC2012數據集的介紹和使用方法的闡述,該數據集已經成為計算機視覺領域的標杆之一,在目標檢測和圖像分割等方面都有著廣泛的應用。希望本文對大家的學習和研究有所幫助。
原創文章,作者:PUNX,如若轉載,請註明出處:https://www.506064.com/zh-tw/n/132819.html