elasticsearch本身自带的中文分词，就是单纯把中文一个字一个字的分开，根本没有词汇的概念。但是实际应用中，用户都是以词汇为条件，进行查询匹配的，如果能够把文章以词汇为单位切分开，那么与用户的查询条件能够更贴切的匹配上，查询速度也更加快速。

 GET _analyze
{
  "text": ["wang ting niubi", "今天给力"]
}
#结果如下
{
  "tokens" : [
    {
      "token" : "wang",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ting",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "niubi",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "今",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "天",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "给",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "力",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

通过示例可以明显看出，“wang ting niubi”,“今天给力”；英文分词根据空格分词相对合理，但中文一个个字拆开显然不合适(今天、给力两个词语没有被识别)

分词器下载网址：

 # 进入es的plugins目录
aaa@ops01:/home >cd /opt/module/elasticsearch-6.6.0/plugins/
# 创建一个插件目录(一个插件对应一个plugins下的子目录)
aaa@ops01:/opt/module/elasticsearch-6.6.0/plugins >mkdir ik
# 下载ik插件zip包
aaa@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >ls
elasticsearch-analysis-ik-6.6.0.zip
# 解压缩安装包并清理zip文件
aaa@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >unzip elasticsearch-analysis-ik-6.6.0.zip && rm elasticsearch-analysis-ik-6.6.0.zip
# 目录结构
aaa@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >ll
total 1432
-rw-r--r-- 1 aaa aaa 263965 May  6  2018 commons-codec-1.9.jar
-rw-r--r-- 1 aaa aaa  61829 May  6  2018 commons-logging-1.2.jar
drwxr-xr-x 2 aaa aaa   4096 Aug 26  2018 config
-rw-r--r-- 1 aaa aaa  54693 Jan 30  2019 elasticsearch-analysis-ik-6.6.0.jar
-rw-r--r-- 1 aaa aaa 736658 May  6  2018 httpclient-4.5.2.jar
-rw-r--r-- 1 aaa aaa 326724 May  6  2018 httpcore-4.4.4.jar
-rw-r--r-- 1 aaa aaa   1805 Jan 30  2019 plugin-descriptor.properties
-rw-r--r-- 1 aaa aaa    125 Jan 30  2019 plugin-security.policy
# 分发插件至其它节点
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik >cd ..
root@ops01:/opt/module/elasticsearch-6.6.0/plugins >scp -r ik ops02:/opt/module/elasticsearch-6.6.0/plugins/
root@ops01:/opt/module/elasticsearch-6.6.0/plugins >scp -r ik ops03:/opt/module/elasticsearch-6.6.0/plugins/

测试使用ik中文分词

常用的ik分词器功能有ik_smart和ik_max_word

ik_smart

逐个去匹配，每个字使用1次

ik_max_word

逐个去匹配，每个字前后能连成词都会展示，相当于尽可能多的形成关系词

【注意】：从上面示例可以看出，不同的分词器，分词有明显的区别，所以以后定义一个type不能再使用默认的mapping，要手工建立mapping来指定分词器, 因为要根据使用场景选择适用合理的分词器

自定义中文词库

生活中，经常会出现一些新的热门词语，比如近期我接触最多的就是yyds永远的神。。。如果始终用之前的词库，那像永远的神就可能分成：永远、的、神，不会是我们所想的永远的神作为一个整体。

那这种情况就需要维护一套用户自定义的中文词库。

在没有自定义中文词库之前，我们先查一个示例，把结果留下，一会安装完自定义词库后作为对比：

安装前：

安装部署自定义词库：

 root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >pwd
/opt/module/elasticsearch-6.6.0/plugins/ik/config
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >ls
extra_main.dic  extra_single_word.dic  extra_single_word_full.dic  extra_single_word_low_freq.dic  extra_stopword.dic  IKAnalyzer.cfg.xml  main.dic  preposition.dic  quantifier.dic  stopword.dic  suffix.dic  surname.dic
# 修改ik插件的config/IKAnalyzer.cfg.xml配置
# <entry key="remote_ext_dict"> 这行配置一个nginx代理地址
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >vim IKAnalyzer.cfg.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "#34;>
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict"></entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <entry key="remote_ext_dict">
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

# 切换至有nginx服务的机器上(没有nginx需要自行部署安装)
root@ops04:/usr/local/nginx-1.10/conf #cd /usr/local/nginx-1.10/
root@ops04:/usr/local/nginx-1.10 #mkdir ik
root@ops04:/usr/local/nginx-1.10 #cd ik
root@ops04:/usr/local/nginx-1.10/ik #mkdir fenci
root@ops04:/usr/local/nginx-1.10/ik #cd fenci
root@ops04:/usr/local/nginx-1.10/ik/fenci #echo "王亭" >> esword.txt
root@ops04:/usr/local/nginx-1.10/ik/fenci #echo "永远的神" >> esword.txt
root@ops04:/usr/local/nginx-1.10/ik/fenci #echo "神圣赞美诗" >> esword.txt
root@ops04:/usr/local/nginx-1.10/ik/fenci #cat esword.txt 
王亭
永远的神
神圣赞美诗

root@ops04:/usr/local/nginx-1.10/ik/fenci #vim /usr/local/nginx-1.10/conf/nginx.conf
listen       80;
        server_name  localhost;

        location / {
            root   html;
            index  index.html index.htm;
        }
		
		# 增加如下配置：
        location /fenci/ {
            root   ik;
        }

root@ops04:/usr/local/nginx-1.10/ik/fenci #/usr/local/nginx-1.10/sbin/nginx -s reload

# 地址必须和IKAnalyzer.cfg.xml配置项对应；也可以先把nginx弄好再配置IKAnalyzer.cfg.xml合理些
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >curl 
王亭
永远的神
神圣赞美诗

# 修改的xml配置，分发至其它节点
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >scp IKAnalyzer.cfg.xml ops02:/opt/module/elasticsearch-6.6.0/plugins/ik/config/ 
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >scp IKAnalyzer.cfg.xml ops03:/opt/module/elasticsearch-6.6.0/plugins/ik/config/ 

# 重启es(各节点)
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >jps | grep Elasticsearch|awk -F" " '{print $1}'
13077
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >kill -9 13077
root@ops01:/opt/module/elasticsearch-6.6.0/plugins/ik/config >cd /opt/module/elasticsearch-6.6.0/bin/
root@ops01:/opt/module/elasticsearch-6.6.0/bin >./elasticsearch -d

重启es后重新再测试：（已经可以成功识别出新定义的词语）

智云一二三科技

elasticsearch 中文分词

测试使用ik中文分词

ik_smart

ik_max_word

关于作者: 智云科技

测试使用ik中文分词

ik_smart

ik_max_word

给这篇文章的作者打赏

关于作者: 智云科技

相关文章

Map和List的几种遍历方式

全网大佬都在用的Java+Python这两套视频学习教程，学习很重要

JavaWeb快速进阶全套教程(程序员必备2020版)：视频+笔记+源码

热门文章

1分享新浪图床上传接口源码

2PHP简单实现路由Route功能

3Tideways、xhprof 和 xhgui 打造 PHP 非侵入式监控平台

4centos系统如何查看是否安装了mysql

5curl 工具简述