初识ElasticSearch | 点滴诗词

What is ElasticSearch

先搬一个官网的定义。

Elasticsearch is a real-time, distributed storage, search, and analytics engine

Elasticsearch 是一个实时的分布式存储、搜索、分析的引擎。

要想了解它是什么，首先得看他能干什么，概念很清晰：分布式存储/搜索/分析引擎。

看这些概念，咋一看，数据库也都可以做到。

分布式存储 - 数据库也可以有主从集群模式
搜索 - 数据库也可以用like %% 来查找

的确，这样做的确可以， mysql也支持全文检索。但是有个问题： like %% 是不走索引的，这就意味着：数据量非常大的时候，我们的查询肯定是秒级的。

我还想提一个概念： 全文检索

类似搜索引擎，输入往往是多种多样的，不同的人有不同的表达方式，但实际都是一个含义，数据库的准确性不高，效率低下，高并发下，数据库会被拖垮。

ElasticSearch 是专门做搜索的，就是为了在理解用户输入语义并高效搜索匹配度高的文档记录。

Elasticsearch基本概念

近实时(NRT)

ElasticSearch是基于Lucene库的，Lucene数据只有刷新到磁盘，才可以被检索到，内存缓存中的数据只有刷新到磁盘才可以被检索。ElasticSearch默认是每秒刷新一次，也就是文档的变化会在一秒之后可见。因此近实时搜索。也可根据自己的需求设置刷新频率。

A Lucene index with new documents in the in-memory buffer

集群(Cluster)

海量数据单机无法存储，就需要使用集群，将多个节点组织在一起，共同维护所有数据，共同提供索引和搜索功能。

节点(node)

一个节点就是集群中的一个服务器，存储部分数据，参与索引与搜索。

分片(shards & replicas)

一个索引可以存储超出单个结点硬件限制的大量数据，为了解决这个问题，Elasticsearch提供了将索引划分成多份的能力，这些份就叫做分片。为保证单点故障，一个分片会保存不止一份，可分为一个主分片(primary shard)与多个*复制分片(replica shard) *，复制分片的数量可动态调整，复制分片也可用来提升系统的读性能。

文档(Document)

一个文档是一个可被索引的基础信息单元。文档以JSON（Javascript Object Notation）格式来表示。

索引(index)

一个索引就是一个拥有几分相似特征的文档的集合。

索引类型(type)

索引类型是在一个索引中，不同类型的数据类型。一条文档中有(type)字段用来区分索引类型，es7.x以上取消同一个索引中存在不同索引类型的数据，也就是说，(_type)字段固定，默认为_doc。

如下，在7.x之前的ES可以在一个索引中创建不同索引类型的数据:

1	curl -XPOST localhost:9200/indexname/typename -H 'Content-Type:application/json' -d '{"data": 1234}'

ElasticSearch RestFul API

ES对外提供RestFul API来读写集群，设置集群，获取集群状态操作。

集群状态API

集群状态

GET /_cluster/health

curl http://localhost:9200/_cluster/health --user xx:xxxx
{
  "cluster_name" : "es",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 1216,
  "active_shards" : 2432,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

集群节点列表

curl http://localhost:9200/_cat/nodes?v --user xxx:xxxx
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
9.135.145.82            25          92   0    0.10    0.13     0.21 cdfhilmrstw *      es-node0
9.135.91.111            21          99   0    0.01    0.07     0.08 cdfhilmrstw -      es-node1
9.135.170.150           48          36   2    0.38    0.33     0.26 cdfhilmrstw -      es-node2

集群健康状态

结果与_cluster/health一致

curl --user elastic:4j243cNvO1770iCs http://10.1.1.45:9200/_cat/health?v

epoch      timestamp cluster     status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1620725415 09:30:15  es-nnx25yd7 green          11         8   4632 2316    0    0        0             0                  -                100.0%

节点分配资源状态

curl --user elastic:4j243cNvO1770iCs http://10.1.1.45:9200/_cat/allocation?v

shards    disk.indices      disk.used      disk.avail      disk.total      disk.percent      　　host              　　 ip            　　　　node
　9 　　 38.8mb 　　　　 9.1gb 　　　 8.6gb 　　 17.7gb 　　　　 51    　　192.168.2.114 　　 192.168.2.114 　　   node-1
　9 　　 38.8mb 　　　　 4.7gb 　　　 13gb 　　  17.7gb 　　　　 26 　　   192.168.2.116 　　 192.168.2.116 　　   node-2

索引文档操作

索引列表

1	curl http://localhost:9200/_cat/indices?pretty --user xx:xxxx

查看索引的设置

1	curl http://localhost:9200/[index_name]/_settings

查看索引映射

1	curl http://localhost:9200/[index_name]/_mapping --user xx:xxx

创建索引

curl -H "Content-Type: application/json" -XPUT localhost:9200/blogs -d '
{
    "settings": {
        "number_of_shards": 3,    # 主分片
        "number_of_replicas": 1   # 副本分片
    }
}'

主分片在索引创建以后就固定了，不可更改，如要修改可重建索引，将数据reindex过去；

副本分片最大值是 n-1(n为节点个数)，复制分片可随时修改个数
1
2
3
4
5
> curl -H "Content-Type: application/json" -XPUT localhost:9200/blogs/_settings -d '
> {
>     "number_of_replicas": 2
> }'
>

reIndex操作

curl -H "Content-Type: application/json" -XPOST localhost:9200/_reindex -d '
{
    "source": {
        "index": "accesslog"
    },
    "dest": {
        "index": "newlog"
    }  
}'

删除索引

1	curl -H "Content-Type: application/json" -XDELETE localhost:9200/[indexname]

查询文档操作

1	POST http://localhost:9200/indexname/_search

查看所有

1	curl -XPOST http://localhost:9200/indexname/_search -H "Content-Type:application/json" -d '{"query":{"match_all":{} } }'

精确匹配（price=549的数据）

1	curl -XPOST http://localhost:9200/indexname/_search -H "Content-Type:application/json" -d '{"query":{"constant_score":{"filter":{"term":{"price":549} } } } }'

term query(title=”java”)

1	curl -XPOST http://localhost:9200/indexname/_search -H "Content-Type:application/json" -d '{"query":{"term":{"title":"java"} } }'

分词查询

1	curl -XPOST http://localhost:9200/indexname/_search -H "Content-Type:application/json" -d '{"query":{"match":{"title":"Core Java"} } }'

分词查询(全匹配)

1 2	curl -XPOST http://localhost:9200/indexname/_search -H "Content-Type:application/json" -d '{"query":{"match":{"title":{"query":"Core Java", "operator":"and"} } } }'

索引模板

dynamic template

"dynamic_templates": [
    {
      "my_template_name": { 
        ...  match conditions ... 
        "mapping": { ... }    # match field use mappings
      }
    },
    ...
  ]
# The match conditions can include any of : match_mapping_type, match, match_pattern, unmatch, path_match, path_unmatch.

match_mapping_type

put myIndex 
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "integers": {       # template name
            "match_mapping_type": "long",   # all fileld value long type
            "mapping": {
              "type": "integer"      # recognate it as integer
            }
          }
        },
        {
          "string_not_analyzed": {
            "match_mapping_type": "string",   # match all string filed
            "mapping": {
              "type": "string",
              "fields": {
                "raw": {
                  "type":  "string",
                  "index": "not_analyzed",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      ]
    }
  }
}

match and unmatch

match和unmatch定义应用于filedname的pattern。

定义一个匹配所有以long_开头且不以_text结束的string类型的模板

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

example

curl -XPOST http://10.1.1.12:9200/_template/default@template --user elastic:b6fBNAapGEcYz2dt -H "Content-Type:application/json" -d '{
    "order" : 1,
    "index_patterns" : [
      "*"
    ],
    "settings" : {
      "index" : {
        "max_result_window" : "65536",
        "refresh_interval" : "30s",
        "unassigned" : {
          "node_left" : {
            "delayed_timeout" : "5m"
          }
        },
        "translog" : {
          "sync_interval" : "5s",
          "durability" : "async"
        },
        "number_of_replicas" : "1"
      }
    },
    "mappings" : {
      "dynamic_templates" : [
        {
          "message_full" : {
            "mapping" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "ignore_above" : 2048,
                  "type" : "keyword"
                }
              }
            },
            "match" : "message_full"
          }
        },
        {
          "msg" : {
            "mapping" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "ignore_above" : 2048,
                  "type" : "keyword"
                }
              }
            },
            "match_pattern": "regex",
            "match" : "msg|pl_message|json"
          }
        },
        {
          "payload_data" : {
            "mapping" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "ignore_above" : 2048,
                  "type" : "keyword"
                }
              }
            },
            "match" : "*payload"
          }
        },
        {
          "message" : {
            "mapping" : {
              "type" : "text"
            },
            "match" : "message"
          }
        },
        {
          "strings" : {
            "mapping" : {
              "type" : "keyword"
            },
            "match_mapping_type" : "string"
          }
        }
      ]
    },
    "aliases" : { }
  }'

快照

# register a snapshot repository
PUT /_snapshot/my_fs_backup
{
    "type": "fs",
    "settings": {
        "location": "/opt/backup_es",
        "compress": true
    }
}

location:my_fs_backup_location 路径必须先在elasticsearch.yaml中配置path.repo

1	path.repo: /opt/backup_es


`location`	Location of the snapshots. Mandatory.
`compress`	Turns on compression of the snapshot files. Compression is applied only to metadata files (index mapping and settings). Data files are not compressed. Defaults to `true`.
`chunk_size`	Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example: `1GB`, `10MB`, `5KB`, `500B`. Defaults to `null` (unlimited chunk size).
`max_restore_bytes_per_sec`	Throttles per node restore rate. Defaults to `40mb` per second.
`max_snapshot_bytes_per_sec`	Throttles per node snapshot rate. Defaults to `40mb` per second.
`readonly`	Makes repository read-only. Defaults to `false`.

快照策略

SLM

elastic设置密码

elasticsearch.yml增加如下配置

1
2
3

xpack.security.enabled: true
xpack.license.self_generated.type: basic
xpack.security.transport.ssl.enabled: true

重新启动es，执行

1	bin/elasticsearch-setup-passwords interactive

这里需要为4个用户分别设置密码，elastic, kibana, logstash_system,beats_system，交互输入密码。

修改密码：

1	curl -H "Content-Type:application/json" -XPOST -u elastic 'http://127.0.0.1:9200/_xpack/security/user/elastic/_password' -d '{ "password" : "123456" }'

索引选项

index.refresh_interval

数据索引后并不会马上搜索到，需要刷新后才能被搜索的，这个选项设置索引后多久会被搜索到。

index.translog

sync_interval
durability

Why yellow

多数据节点故障
为索引使用损坏的或红色的分区
高 JVM 内存压力或 CPU 利用率
磁盘空间不足

Fix yellow

列出未分配的分区

1	curl -XGET 'localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason' \| grep UNASSIGNED

输出：

xxxxx                             0 r UNASSIGNED INDEX_CREATED
yyyyy                             0 r UNASSIGNED INDEX_CREATED
zzzzz              								0 r UNASSIGNED INDEX_CREATED
rrrrr										          0 r UNASSIGNED INDEX_CREATED

展示出所有未分配的分片的列表

检索为什么未分配

1	curl -XGET 'localhost:9200/_cluster/allocation/explain?pretty' -H 'Content-Type:application/json' -d'{"index": "xxxxx", "shard": 0, "primary":false}'

输出：(未记录输出)

会给出集群中所有节点不能分配的原因。

解决

如果是磁盘空间不足，删除不必要的索引。对于其他原因，可根据情况解决不能分配的原因。比如下面几个常见的原因。

a. cluster.max_shards_per_node默认为1000，节点分片已经达到最大。

b. 磁盘空间达到配置的阈值，比如磁盘已经达到80%，不会继续分配分片。

c. 分片设置的节点必须是hot节点。

可通过如下接口查看当前磁盘分配配置：

1	curl -XGET _cluster/settings?include_defaults=true&flat_settings=true&pretty

输出(输出太多截取一部分)：

{
  "persistent" : {
    "cluster.routing.allocation.disk.watermark.flood_stage" : "95%",
    "cluster.routing.allocation.disk.watermark.high" : "90%",
    "cluster.routing.allocation.disk.watermark.low" : "85%"
  },
  "transient" : {
    "cluster.max_shards_per_node" : "10000",
    "cluster.routing.allocation.disk.watermark.flood_stage" : "95%",
    "cluster.routing.allocation.disk.watermark.high" : "90%",
    "cluster.routing.allocation.disk.watermark.low" : "85%"
  },
  .....

索引生存周期(ILM)

适用于单索引并不断增长，可设置ILM rollover，根据大小或者文档条数拆分.

对于按天索引，可配置删除阶段规则.

创建ILM策略(hot/warm/cold/delete)
创建索引模板，指定ILM的范围
创建rollover的索引，名称末尾要是数字，这样rollover就会+1，如：carlshi-00001;配置is_write_index选项
原索引写入数据

For Example:

# 创建索引模板
PUT /_template/carl_template
{
  "index_patterns": [  # 匹配的索引名称
    "carl-*"
  ],
  "settings": {
      "refresh_interval": "30s",
      "number_of_shards": "1",
      "number_of_replicas": "0"
  },
  "mappings": {   # mapping
    "properties": {
      "name": {
        "type": "keyword"
      }
    }
  }
}

创建索引：

# 创建第一个索引
PUT /carlshi-000001
{
  "aliases": {
    "carlshi-index": {        # 索引alias，写入carlshi-index的都会写入carlshi-00001
      "is_write_index": true
    }
  }
}

elasticsearch docker

直接运行elasticsearch，会自动拉去镜像并执行；

1	docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.5.1 -v /usr/share/elasticsearch/data:/usr/share/elasticsearch/data

运行成功后，执行curl，获取基本信息

curl localhost:9200
{
  "name" : "be856c56d8bd",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "bsnwunE2SnWcBIqoxbgnUw",
  "version" : {
    "number" : "7.5.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "3ae9ac9a93c95bd0cdc054951cf95d88e1e18d96",
    "build_date" : "2019-12-16T22:57:37.835892Z",
    "build_snapshot" : false,
    "lucene_version" : "8.3.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

小结

ElasticSearch是一款强大的全文检索工具，他提供REST API使得使用ElasticSearch非常简单，对数据做了很强的高可用，也可根据自己的需求配置不同级别的高可用、高性能全文检索工具。

本篇主要讲解对ElasticSearch的常用模块做了简单的介绍，索引的基本属性基本操作(增删改查)，动态索引模式模板，快照备份，索引生存周期；还记录了集群黄色的排查方向。以后逐步深入各个模块的配置甚至内部实现原理。