【第4天】资料前处理-图档分类与裁切

现况

以YOLOv4模型框选中文字後，将资料集(约7万张)区分为以下类别：

1.1 word(仅有1个中文字)

1.2 words(2个以上中文字)

1.3 no_word(无文字)
因「正式比赛时，每张图档内只会有一个最正确的中文字」，筛选出仅有1个中文字的图档作为新资料集。
筛选出word(仅有1个中文字)的图档後，仍有以下问题。

3.1 仅有1个中文字的图档，物件侦测框外仍有大面积空白。故以opencv-python去除空白背景。

3.2 裁切後的图档中，有许多错误的标签(如，图档中的文字是「鸿」，标签却是「卓」)

3.3 更正标签後的图档中，有部分标签名称不在主办单位提供的800字内。

工具/套件

opencv-python
shutil
numpy

内容

图档分类

1.1 物件侦测：读取YOLOv4模型框选中文字，并回传物件侦测框选范围(boxes)，取len(boxes)可以得知，该图档框选出几个中文字。

import cv2
import numpy as np
import os
import shutil

#读取模型与训练权重
def initNet():
    CONFIG = 'yolov4-tiny-myobj.cfg'
    WEIGHT = 'yolov4-tiny-myobj_last.weights'

    net   = cv2.dnn.readNet(CONFIG,WEIGHT)
    model = cv2.dnn_DetectionModel(net)
    model.setInputParams(size=(416,416),scale=1/255.0)
    model.setInputSwapRB(True)
    return model

#物件侦测
def nnProcess(image, model):
    classes, confs, boxes = model.detect(image, 0.4, 0.1)
    return classes, confs, boxes

1.2 图档分类

为了避免变动原始资料集，决定以shutil将分类後的档复制到新资料夹。

以框选的中文字数量(box_num)，执行图档分类，程序码如下。

#依照侦测到的物件数量进行分类
def copyClassify(file ,input, boxes, file_name, l, m, n):
    box_num = len(boxes)
    if box_num == 0:
        shutil.copy2(input, './02_yolo_classify3/03_no_word/{}'.format(file_name))
        print('※{}成功复制到no_word'.format(file))
    elif box_num == 1:
        shutil.copy2(input, './02_yolo_classify3/01_word/{}'.format(file_name))
        print('※{}成功复制到word'.format(file))
    else:
        shutil.copy2(input, './02_yolo_classify3/02_words/{}'.format(file_name))
        print('※{}成功复制到words'.format(file))
    print('  没有字：{}张'.format(l))
    print('  1个字：{}张'.format(m))
    print('  2个字以上：{}张'.format(n))

储存/读取图档

opencv-python储存图档时，若存档路径中有中文字，须使用cv2.imdecode( )。(cv2.imwrite仅适用英文路径)

# 储存已完成前处理之图档(中文路径)
def saveClassify(image, output, p):
    cv2.imencode(ext='.jpg', img=image)[1].tofile(output)
    print('第{}张框字并储存成功'.format(p))

读取图档时，若路径中有中文字，亦须使用cv2.imdecode( )。(cv2.read仅适用英文路径)

# 读取图档(中文路径)
cv2.imdecode(np.fromfile(img_path, dtype=np.uint8), -1)

图档裁切

3.1 裁切：物件侦测框线设定为2px，裁切时须要注意是否会超出图片范围。

#框选侦测到的物件，并裁切
def drawBox(image, classes, confs, boxes):
    new_image = image.copy()
    cut_img_list = []
    for (classid, conf, box) in zip(classes, confs,boxes):
        x, y, w, h = box
        # 避免x, y轴超出图片范围
        if x - 2 < 0:
            x = 2
        if y - 2 < 0:
            y = 2
        # 画出物件侦测框
        cv2.rectangle(new_image, (x - 2, y - 2), (x + w + 2, y + h + 2), (0, 255, 0), 2)
        # 裁切侦测框内的中文字
        cut_img = img[y:y + h + 2, x:x + w + 2]
        cut_img_list.append(cut_img)
    return new_image, cut_img_list[0]

3.2 裁切後图档，存档时覆盖新资料集(分类後的图档)。

if __name__ == '__main__':
    # 主办单位提供的资料集(约7万张)
    source = './01_origin/'
    files = os.listdir(source)
    # 依照正整数排序
    files.sort(key=lambda x:int(x[:-6]))
    model = initNet()
    for file in files:
        img = cv2.imdecode(np.fromfile(source+file,dtype=np.uint8), -1)
        classes, confs, boxes = nnProcess(img, model)
        try:
            frame, cut = drawBox(img, classes, confs, boxes)
            # 框选後的照片
            frame = cv2.resize(frame, (240, 200), interpolation=cv2.INTER_CUBIC)
            # 显示框选後的图片
            cv2.imshow('img', frame)
            # 裁切後的照片
            cut2 = cv2.resize(cut, (80, 60), interpolation=cv2.INTER_CUBIC)
            cv2.imshow('cut', cut2)
            cv2.waitKey()
            saveClassify(cut2, './02_yolo_classify3/cut2/' + file, p) #储存裁切後的照片
        except:
            continue
    print('程序执行完毕')

3.3 成果

裁切前
裁切後

标签错误：
- 「达成人工智慧之前，免不了先经历工人智慧」。夥伴们人数众多，逐张检查图档标签，并手动更正标签。
- 整整有6.6万张图档，夥伴们除了耗费大量时间检查修正，甚至可能头昏眼花看错，效率低下。(痛苦程度300分)
- 若大家有更好的标签勘误技巧，请留言告诉我，谢谢！

标签不在800字内

5.1 800字字典(txt档)

5.2 判定标签是否在800字内，程序码如下。

import os
import shutil

#读取txt档
def read_dicts(path):
    file1 = open(path, 'rt', encoding="utf-8")
    words = file1.read().split('\n')
    file1.close()
    return words

#判定是否属於字典中的字
def chech_in_dicts(source, words):
    files = os.listdir(source)
    files.sort(key=lambda x:int(x[:-6]))
    move_record = ''
    print('※开始判定是否属於字典中的字...')
    for file in files:
        if file[-5:-4] in words:
            print('{}在字典里'.format(file))
        else:
            print('{}不在字典里'.format(file))
            file += ','
            move_record += file
    print('判定完毕')
    return move_record

#移动档案到目标资料夹
def move_to_des(move_record, source, destination):
    move_list = move_record.split(',')[:-1]
    print('※开始移动档案到目标资料夹')
    for move_it in move_list:
        shutil.move(source+move_it, destination)
        print('{}已成功移动到资料夹：其他字'.format(move_it))
    print('移动完毕')

if __name__ == '__main__':
    # training data dic.txt
    dics = './data/training data dic.txt'
    # 待判定的资料夹
    source = './data/04_清洗标签後图片/origin/'
    # 目的地资料夹
    destination = './data/04_清洗标签後图片/800字外/'
    words = read_dicts(dics)
    move_record = chech_in_dicts(source, words)
    move_to_des(move_record, source, destination)

5.3 成果