IT铁人第29天 Elasticsearch 使用python查询资料 Aggregations：Terms

今天的文章要介绍的是Bucket Aggregations的一种聚合方式，其实Metrics Aggregations还有很多种聚合方式没讲完，但因为剩下两天了所以想把这个我比较常用的聚合方式告诉大家

这次的测试资料：

Terms

・这种聚合方式会根据指定字段的每个唯一值形成一个桶，在历遍所有文档根据指定字段的值丢到相应的桶里面执行收集

・预设情况会返回前10名的桶，可以透过可调参数size调整

・一般情况下ES会透过分片搜集，每个分片会回传前几size个的桶(size=1就回传分片中最多数量的桶)，再将分辨结果合并减少到变成最终的列表，再回传给客户端。如果唯一值大於size，返回的列表可能会跟现实有点偏离，导致出来的结果不是很正确，听起来很抽象，我直接用例子说明

size=3

分片1

top1 "铁" 10
top2 "人" 8
top3 "赛" 7
top4 "IT" 6
回传"铁"、"人"、"赛"

分片2

top1 "IT" 5
top2 "人" 4
top3 "铁" 3
top4 "赛" 2
回传"IT"、"人"、"铁"

最终返回结果
top1 "铁":13个、top2 "人":12个、top3 "赛":7个
可以看到原本的"IT"应该是top3(11个)，但因为他在分片1里排在top4超出了我们所设定的size=3所以没回传，而结果也可以看到赛从原本的9个变成7个，这边是在设定size时所要注意的部分

先设定一个目标，假设想知道有哪些班级且人数有多少
aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10
      }
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "资工一1",
        "doc_count" : 3
      },
      {
        "key" : "资工一2",
        "doc_count" : 3
      }
    ]
  }
}

ES内部会先建立“资工一1”的桶跟“资工一2”的桶，然後再进行收集

・show_term_doc_count_error

这是terms聚合的可调参数，将这个参数设为True将会多回传一个值，这个值代表该唯一值最大潜在数量，但没有被计入最终结果中
aggs query：

"aggs": {
  "class_num": {
    "terms": {
      "field": "class",
      "size": 10,
      "show_term_doc_count_error": true
      
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "资工一1",
        "doc_count" : 3,
        "doc_count_error_upper_bound" : 0
      },
      {
        "key" : "资工一2",
        "doc_count" : 3,
        "doc_count_error_upper_bound" : 0
      }
    ]
  }
}

・min_doc_count

限制最低数量
aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10,
        "show_term_doc_count_error": true,
        "min_doc_count": 10
      }
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [ ]
  }
}

・order

设定结果回传时的排序方式，预设是照doc_count排序

按字母进行排序

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "name",
        "size": 10,
        "order": {
          "_key": "asc"
        }
      }
    }
  }
}

也可以指定一个单值的metric聚合来排序，例如以下利用各班的数学平均来排序
aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10,
        "order": {
          "math_avg": "desc"
        }
      },
      "aggs": {
        "math_avg": {
          "avg": {
            "field": "grades.math"
          }
        }
      }
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "资工一1",
        "doc_count" : 3,
        "math_avg" : {
          "value" : 74.66666666666667
        }
      },
      {
        "key" : "资工一2",
        "doc_count" : 3,
        "math_avg" : {
          "value" : 67.0
        }
      }
    ]
  }
}

多值的metric聚合也可以，但是要指定字段
以下改为用最小值排序
aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10,
        "order": {
          "math_avg.min": "desc"
        }
      },
      "aggs": {
        "math_avg": {
          "stats": {
            "field": "grades.math"
          }
        }
      }
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "资工一1",
        "doc_count" : 3,
        "math_avg" : {
          "count" : 3,
          "min" : 60.0,
          "max" : 91.0,
          "avg" : 74.66666666666667,
          "sum" : 224.0
        }
      },
      {
        "key" : "资工一2",
        "doc_count" : 3,
        "math_avg" : {
          "count" : 3,
          "min" : 34.0,
          "max" : 91.0,
          "avg" : 67.0,
          "sum" : 201.0
        }
      }
    ]
  }
}

・script

可以使用脚本来组合字段再统计数量
aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10,
        "script": {
          "lang": "painless",
          "source": "doc['class'].value + doc['name'].value"
        }
      }
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "资工一1小新",
        "doc_count" : 1
      },
      {
        "key" : "资工一1王小明",
        "doc_count" : 1
      },
      {
        "key" : "资工一1风间",
        "doc_count" : 1
      },
      {
        "key" : "资工一2正男",
        "doc_count" : 1
      },
      {
        "key" : "资工一2许小美",
        "doc_count" : 1
      },
      {
        "key" : "资工一2阿呆",
        "doc_count" : 1
      }
    ]
  }
}

include/exclude

可以藉由这两个可调参数filter调设定的条件
假设今天我想找到资工所有班级但不想包含资工一2
aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10,
        "include": ".*资工.*",
        "exclude": ".*一2.*"
      }
    }
  }
}

结果：

"aggregations" : {
  "class_num" : {
    "doc_count_error_upper_bound" : 0,
    "sum_other_doc_count" : 0,
    "buckets" : [
      {
        "key" : "资工一1",
        "doc_count" : 3
      }
    ]
  }
}

partition/num_partitions

这个单次请求如果要处理的唯一值太多时，可以使用这两个可调参数拆解成多份，在请求时针对一份做处理
partition：指定这次的请求要针对哪一份做处理
num_partitions：要分成几份

aggs query：

{
  "aggs": {
    "class_num": {
      "terms": {
        "field": "class",
        "size": 10,
        "include": {
          "partition": 1,
          "num_partitions": 3
        }
      }
    }
  }
}

结果：

"aggregations" : {
    "class_num" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "资工一2",
          "doc_count" : 3
        }
      ]
    }
  }

今天的文章就到这边，谢谢大家

<<: Lektion 30. 德国・工作生活心得 Leben und Arbeit in Deutschland

>>: HERE API Example - 限制地图移动

IT铁人第29天 Elasticsearch 使用python查询资料 Aggregations：Terms

Terms

分片1

分片2

・show_term_doc_count_error

・min_doc_count

・order

・script

include/exclude

partition/num_partitions

[Day21]程序菜鸟自学C++资料结构演算法 – 杂凑搜寻法实作

【Day6】如何检查型别

第05天 - 一些些的Bootstrap、CSS

android studio 30天学习笔记-day 1 - 前言

【D26】熟练一下厨具－bid and ask #1：什麽是选择权价差单

[Android Studio 30天自我挑战] Progress Bar练习

[Day 29 - 小试身手] Todolist with React (4)

依赖注入

[Day 17] 我的资料哪有这麽平衡！第二季 (class weights)

IOS、Python自学心得30天 Day-26 Firebase部分