【第22天】训练模型-模型组合与辨识isnull(三)

摘要

  1. 交叉验证不同方法组合的模型准确率

    1.1 参数说明

    1.2 程序码

  2. 选择模型组合方法

    2.1 交叉比对结果

    2.2 结论


内容

  1. 交叉验证不同方法组合的模型准确率。

    1.1 参数说明

    • s1 → 加权依据: 1 每个字的准确度占比 ; 2 单独计算800个中文字机率占比
    • s2 → isnull判断依据: 1 模型投票(多数决) ; 2 每个字机率加权,取最大机率
    • s3 → isnull阈值: 1 机率最小值; 2 机率平均值

    1.2 程序码

    # 产出各模型组合(交叉比对2*2*2共8种)
    k <- c(1,2,3,4,5,6,7)
    selectMatrix <- BitMatrix(length(k))
    model_count = apply(selectMatrix, 1, function(x){ k[which(x==1)] })
    model_count = model_count[-1]
    
    # 定义参数初始值
    mm_name = NULL
    
    in800 = NULL
    out800 = NULL
    test = NULL
    
    in800_acc = NULL
    out800_acc = NULL
    test_acc = NULL
    
    in800_confus = NULL
    test_confus = NULL
    
    test_o800_acc = NULL
    
    s1list = NULL
    s2list = NULL
    s3list = NULL
    
    count2 = 1
    
    # 交叉比较
    for(count in 1:length(model_count)){
      nmodel = length(model_count[[count]])
      model_name = c(1,2,3,'ex3',"ex4",'ex5','ex6')[model_count[[count]]]
      print(model_name)
      # 各字准确度 & 各字机率占比
      for(s1 in 1:2){
        # 各字准确度(产权重表+新机率表)
        if(s1 == 1){
          new_model_i800 = get_new_model(namesmodel = model_name,stat = 'acc',dataset = "offical_in800")
          new_model_o800 = get_new_model(namesmodel = model_name,stat = 'acc',dataset = "offical_noin800")
          new_model_test = get_new_model(namesmodel = model_name,stat = 'acc',dataset = "test_data")
          new_model_i800_new = new_model_i800[[2]]
          new_model_o800_new = new_model_o800[[2]]
          new_model_test_new = new_model_test[[2]]
          new_model_i800 = new_model_i800[[1]]
          new_model_o800 = new_model_o800[[1]]
          new_model_test = new_model_test[[1]]
          # 投票判断 & 组合後判断
          for(s2 in 1:2){
            # 投票判断
            if(s2 == 1){
              # 阈值用最小值 & 用平均机率
              for(s3 in 1:2){
                  # 用最小值
                if(s3 == 1){
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_in800")
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_noin800")
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "test_data")
             }
             else{
                  # 用平均机率 
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_in800")
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_noin800")
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "test_data")
             }
    
                # 模型组合名称
                mm_name[count2] = paste(model_name,collapse = ",")
    
                s1list[count2] = s1
                s2list[count2] = s2
                s3list[count2] = s3
    
                # 几笔资料
                in800[count2] = nrow(new_model_i800)
                out800[count2] = nrow(new_model_o800)
                test[count2] = nrow(new_model_test)
    
                in800_acc[count2] = mean(new_model_i800$acc)
                out800_acc[count2] = mean(new_model_o800$acc_null)
                test_acc[count2] = mean(new_model_test$acc[new_model_test$origin_word != "isnull"])
    
                test_o800_acc[count2] = mean(new_model_test$acc_null[new_model_test$origin_word == 'isnull'])
    
                in800_confus[count2] = mean(new_model_i800$acc_null[new_model_i800$acc == 1])
                test_confus[count2] = mean(new_model_test$acc_null[new_model_test$acc == 1])
                count2 = count2 + 1
              }
    
            }
            else{
              # 组合後判断
              # 阈值用最小值 & 用平均机率  
              for(s3 in 1:2){
                  # 用最小值
                if(s3 == 1){
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_in800",
                                                   new_data = new_model_i800,new_stat = new_model_i800_new)
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_noin800",
                                                   new_data = new_model_o800,new_stat = new_model_o800_new)
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "test_data",
                                                   new_data = new_model_test,new_stat = new_model_test_new)
                }
                else{
                  # 用平均机率 
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_in800",
                                                   new_data = new_model_i800,new_stat = new_model_i800_new)
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_noin800",
                                                   new_data = new_model_o800,new_stat = new_model_o800_new)
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "test_data",
                                                   new_data = new_model_test,new_stat = new_model_test_new)
                }
                mm_name[count2] = paste(model_name,collapse = ",")
                s1list[count2] = s1
                s2list[count2] = s2
                s3list[count2] = s3
    
                in800[count2] = nrow(new_model_i800)
                out800[count2] = nrow(new_model_o800)
                test[count2] = nrow(new_model_test)
    
                in800_acc[count2] = mean(new_model_i800$acc)
                out800_acc[count2] = mean(new_model_o800$acc_null)
                test_acc[count2] = mean(new_model_test$acc[new_model_test$origin_word != 'isnull'])
    
                test_o800_acc[count2] = mean(new_model_test$acc_null[new_model_test$origin_word == 'isnull'])
    
                in800_confus[count2] = mean(new_model_i800$acc_null[new_model_i800$acc == 1])
                test_confus[count2] = mean(new_model_test$acc_null[new_model_test$acc == 1])
                count2 = count2 + 1
              }
            }
          }
        }
        else{
    
          # 各字机率平均
          new_model_i800 = get_new_model(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_in800")
          new_model_o800 = get_new_model(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_noin800")
          new_model_test = get_new_model(namesmodel = model_name,stat = 'mean_prob',dataset = "test_data")
          new_model_i800_new = new_model_i800[[2]]
          new_model_o800_new = new_model_o800[[2]]
          new_model_test_new = new_model_test[[2]]
          new_model_i800 = new_model_i800[[1]]
          new_model_o800 = new_model_o800[[1]]
          new_model_test = new_model_test[[1]]
          # 投票判断 & 组合後判断
          for(s2 in 1:2){
            # 投票判断
            if(s2 == 1){
              # 阈值用最小值 & 用平均机率
              for(s3 in 1:2){
                # 用最小值
                if(s3 == 1){
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_in800")
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_noin800")
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "test_data")
                }
                else{
                  # 用平均机率
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_in800")
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_noin800")
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "test_data")
             }
                mm_name[count2] = paste(model_name,collapse = ",")
                s1list[count2] = s1
                s2list[count2] = s2
                s3list[count2] = s3
    
                in800[count2] = nrow(new_model_i800)
                out800[count2] = nrow(new_model_o800)
                test[count2] = nrow(new_model_test)
    
                in800_acc[count2] = mean(new_model_i800$acc)
                out800_acc[count2] = mean(new_model_o800$acc_null)
                test_acc[count2] = mean(new_model_test$acc[new_model_test$origin_word != 'isnull'])
    
                test_o800_acc[count2] = mean(new_model_test$acc_null[new_model_test$origin_word == 'isnull'])
    
                in800_confus[count2] = mean(new_model_i800$acc_null[new_model_i800$acc == 1])
                test_confus[count2] = mean(new_model_test$acc_null[new_model_test$acc == 1])
                count2 = count2 + 1
              }        
            }
            else{
              # 整并後判断
              # 阈值用最小值 & 用平均机率
              for(s3 in 1:2){
                # 用最小值
                if(s3 == 1){
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_in800",
                                                   new_data = new_model_i800,new_stat = new_model_i800_new)
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "offical_noin800",
                                                   new_data = new_model_o800,new_stat = new_model_o800_new)
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'min_prob',dataset = "test_data",
                                                   new_data = new_model_test,new_stat = new_model_test_new)
                }else{
                  # 用平均机率
                  new_model_i800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_in800",
                                                   new_data = new_model_i800,new_stat = new_model_i800_new)
                  new_model_o800$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "offical_noin800",
                                                   new_data = new_model_o800,new_stat = new_model_o800_new)
                  new_model_test$acc_null = get_min01(namesmodel = model_name,stat = 'mean_prob',dataset = "test_data",
                                                   new_data = new_model_test,new_stat = new_model_test_new)
                }
                mm_name[count2] = paste(model_name,collapse = ",")
                s1list[count2] = s1
                s2list[count2] = s2
                s3list[count2] = s3
    
                in800[count2] = nrow(new_model_i800)
                out800[count2] = nrow(new_model_o800)
                test[count2] = nrow(new_model_test)
    
                in800_acc[count2] = mean(new_model_i800$acc)
                out800_acc[count2] = mean(new_model_o800$acc_null)
                test_acc[count2] = mean(new_model_test$acc[new_model_test$origin_word != 'isnull'])
    
                test_o800_acc[count2] = mean(new_model_test$acc_null[new_model_test$origin_word == 'isnull'])
    
                in800_confus[count2] = mean(new_model_i800$acc_null[new_model_i800$acc == 1])
                test_confus[count2] = mean(new_model_test$acc_null[new_model_test$acc == 1])
                count2 = count2 + 1
              }
            }
          }
        }
      }
    }
    
    result = data.frame(
      mm_name = mm_name,
      s1list = s1list,
      s2list = s2list,
      s3list = s3list,
      in800 = in800,
      in800_acc = in800_acc,
      in800_confus = in800_confus,
      out800 = out800,
      out800_acc = out800_acc,
      test = test,
      test_acc = test_acc,
      test_confus = test_confus,
      test_null = test_o800_acc
    )
    write.csv(result,file = "C:/Users/wooden/Desktop/dl/model/model_statement.csv",row.names = F)
    
  2. 选择模型组合方法

    2.1 交叉比对结果

    2.2 考量

    • 辨识时间

      • 从下图可得知,4个模型组合(A处)较3个模型组合(B处)辨识效果佳。
      • 但正式赛收到官方请求後,需在1秒内回传辨识结果,4个模型组合可能面临辨识时间过长,导致来不及回传结果的困境,故优先选择3个模型的组合。

    • 辨识准确度

      • 综合考量官方800字内、官方800字外、测试赛准确度,我们挑选出3个模型组合。
      • 辨识表现:A > C > B,模型组合方法采用[s1list]=2、[s3list]=2、[s3list]=2 (如下图)

    2.2 结论

    • 选定模型组合方法

      • s1_加权依据:单独计算800个中文字机率占比(2)
      • s2_isnull判断依据:每个字机率加权,取最大机率(2)
      • s3_isnull阈值:机率平均值(2)
    • 选定组合哪几个模型

      • 模型3:InceptionResNetV2
      • 模型ex3:Xception
      • 模型ex4:Resnet152V2


小结

  1. 交叉验证後,我们选择了模型组合方法与模型。虽然区分isnull的准确率是100%,考量到测试赛资料集样本少,且出现isnull的频率低,正式赛表现应该会打个折扣。
  2. 接下来,我们需要将模型打包部署到GCP上,以API的形式提供手写中文字辨识服务。故下一站,目标是:「分享如何启用GCP服务,并以Computer Engine API架设VMware」。

让我们继续看下去...


<<:  订单清单 - 已完成清单(MVC 的权责分职)

>>:  [DAY 24] _DMA简介

Day 41 (PHP)

1.while可以同时给值并判断 会给值的原因:是因为前面都没有宣告$row(猜测) $fp = @...

DAY22-EXCEL统计分析:双因子变异数分析介绍

双因子变异数分析 双因子变异数分析主要是想知道两个因子之间是否有明显的交互作用或主效果。因此双因子变...

Day 14 - Spring Boot & Thymeleaf

Thymeleaf 是Spring Boot 推荐使用的前端模板引擎,它除了可以完全取代JSP 外,...

Day-24 AlertDialog

AlertDialog与Toast皆可用於显示讯息, 但与Toast不同的是, AlertDialo...

完赛,最终章,平安就是幸福

呼,今天终於完赛了,这是第一次参赛,能顺利完赛真的屎尿未及阿。 回首看这一系列,这一系列的初衷是源自...