【第26天】探讨与改善-增加训练样本(一)

摘要

前言
作业流程
手写中文字开源资料
空白背景图片
筛选出官方800字内

内容

前言

1.1 从赛後的交流中得知，胜出的队伍将重心放在资料集本身，而非设计或采用更新更强大的模型架构。分享中曾提及：「修正错误标签、增加大量拟真训练样本、改善资料类别不均衡...等」。

1.2 事後回想，我们在这些方面，确实没有特别下工夫。因此，後续几天我们将透过实作尝试这些技巧。
作业流程(今日进度为2.1~2.3)

2.1 手写中文字开源资料

2.2 空白背景图片

2.3 筛选出官方800字内

2.4 OpenCV合成新训练集
手写中文字开源资料

3.1 AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset

3.2 kirosc/chinese-calligraphy-dataset

空白背景图片

4.1 观察官方资料集，内含不少只有空白背景的图片，如下图。

4.2 在【第4天】资料前处理-图档分类与裁切中，我们曾以训练好的YOLOv4模型框选中文字，得知每张图档框选出的中文字数。

4.3 我们依照侦测到的中文字数量分类，取出没有字(no_word)的图档。

def copyClassify(file ,input, boxes, file_name, l, m, n):
    box_num = len(boxes)
    if box_num == 0:
        shutil.copy2(input, './02_yolo_classify3/03_no_word/{}'.format(file_name))
        print('※{}成功复制到no_word'.format(file))
    elif box_num == 1:
        shutil.copy2(input, './02_yolo_classify3/01_word/{}'.format(file_name))
        print('※{}成功复制到word'.format(file))
    else:
        shutil.copy2(input, './02_yolo_classify3/02_words/{}'.format(file_name))
        print('※{}成功复制到words'.format(file))
    print('  没有字：{}张'.format(l))
    print('  1个字：{}张'.format(m))
    print('  2个字以上：{}张'.format(n))

4.4 最终，取得no_word图档约400张，如下图。

筛选出官方800字内

5.1 两个手写中文字开源资料，合计265,249张图档。其中，不只有官方800字内的文字，故需先进行筛选。

5.2 程序码

import os
import shutil

# 读取txt档
def read_dicts(path):
    file1 = open(path, 'rt', encoding="utf-8")
    words = file1.read().split('\n')
    file1.close()
    return words

# 判定是否属於字典中的字
def chech_in_dicts(source, words):
    files = os.listdir(source)
    move_record = ''
    print('※开始判定是否属於字典中的字...')
    for file in files:
        if file[0] in words:
            print('{}在字典里'.format(file))
        else:
            print('{}不在字典里'.format(file))
            file += ','
            move_record += file
    print('=' * 50)
    print('※判定完毕')
    print('=' * 50)
    return move_record

# 移动档案到目标资料夹
def move_to_des(move_record, source, destination):
    move_list = move_record.split(',')[:-1]
    print('※开始移动档案到目标资料夹')
    for move_it in move_list:
        shutil.move(source+move_it, destination)
        print('{}已成功移动到资料夹：其他字'.format(move_it))
    print('=' * 50)
    print('※移动完毕')

if __name__ == '__main__':
    # training data dic.txt
    dics = './data/training data dic.txt'
    # 待判定的资料夹
    source = './base/'
    # 目的地资料夹
    destination = './out800/'

    # 执行任务
    words = read_dicts(dics)
    move_record = chech_in_dicts(source, words)
    move_to_des(move_record, source, destination)

5.3 结果

判定前
判定後

小结

取得官方800字内的手写中文字与空白背景後，下一章的目标是：「使用OpenCV合成新的训练样本」

让我们继续看下去...

参考资料

AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset
- 本数据集由 AI . FREE Team 改作开发自 [STUST EECS_Chinese MNIST(总集)]。如有使用、改作、分享，请注明出处及此讯息。
- The dataset is AI . FREE Team development from [STUST EECS_Chinese MNIST(总集)]. If used, modified, or shared, please cite the source and the mesage.
- (source: https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset )
kirosc/chinese-calligraphy-dataset

<<: Day26 NodeJS中的前端框架 II

>>: Day 26：书单

Graph-Tree: uva 615 Is It A Tree?

杂谈

[Lesson2] Android Studio安装

杂谈

you only look once - YOLO （2）

杂谈

【从零开始的 C 语言笔记】第二十一篇－continue & break

杂谈

刺蝟跟狐狸理论，我当然全都要

杂谈

【履历要点 i 】来自大公司资深 Recruiter 的建议

今天差点忘记打文章，好想睡.. 这篇将统整上次我半夜撑着听完 Mayuko 和 levels.fyi...

[Day28] 第二十八课 Azure灾害复原(DRaaS)-1[进阶]

先前提到Azure有很多客户拿来做第三份备份存放外，异地备援陆续也有不少客户来询问了解，原因就在异...

12 - Metrics - 观察系统的健康指标 (6/6) - 使用 Metricbeat 掌握 Infrastructure 的健康状态 AWS 篇

Metrics - 观察系统的健康指标系列文章 (1/6) - Metrics 与 Metricb...

深不可测的海 - Regular Expression

使用终端机搜寻特定字串时，大家一定用过 grep 这个指令吧～但你有想过 grep 为什麽叫 gr...

【Day01】JavaScript 是如何运行的

程序语言的运行过程在知道如何运行之前，必须先了解程序语言是如何被运行的。程序语言依照运行方式可分...