本文对HanLP的es插件踩坑过程做个简单的记录。
HanLP的ES插件
HanLP的ES插件基本都收录在HanLP的这个wiki页面里了:
https://github.com/hankcs/HanLP/wiki/%E8%A1%8D%E7%94%9F%E9%A1%B9%E7%9B%AE
这次踩坑选的是 https://github.com/shikeio/elasticsearch-analysis-hanlp
准备
- elastic search 6.2.4
- kibana 6.2.4
- 插件的预编译release analysis-hanlp-6.2.4.zip- 其实也可以自己下载源码编译(也就是其github主页里说的gradle mvn),但是源码编译的时候,引用到的elasticsearch-gradle要求用jdk 10进行编译
- 下载的预编译版本是用jdk 9编译的,因此其class 文件的版本号是53,导致后面运行elasticsearch的时候也必须用java 9去运行
 
- 其实也可以自己下载源码编译(也就是其github主页里说的
- HanLP data data-for-1.6.4.zip
假设es解压后的目录为{es_home},然后有另一个不相关的任意的目录{hanlp_home},将data-for-1.6.4.zip解压到此目录下,得到如下目录结构:
| 1 | . | 
开始
初始启动
- 启动es
- 启动kibana
- 浏览器访问http://127.0.0.1:5601/
- 进入DevTools
- 在Console文本框里输入下面的内容,点发送
| 1 | GET _search | 
如果以上步骤都没有报错,说明初始状态的所有东西都是正常的。如果不正常,请自行解决。
部署插件
将analysis-hanlp-6.2.4.zip解压得到elasticsearch目录,将其重命名为analysis-hanlp,移动到{es_home}/plugins目录下,然后按照插件的文档去修改hanlp.properties和{es_home}/config/jvm.options文件,然后重启es(不用重启kibana)。
如无意外,你将会得到如下错误:
| 1 | Jun 30, 2018 10:38:11 PM com.hankcs.hanlp.HanLP$Config <clinit> SEVERE: 没有找到hanlp.properties,可能会导致找不到data ========Tips======== 请将hanlp.properties放在下列目录: Web项目则请放到下列目录: Webapp/WEB-INF/lib Webapp/WEB-INF/classes Appserver/lib JRE/lib 并且编辑root=PARENT/path/to/your/data 现在HanLP将尝试从/home/gordon/Dev/es/6.2.4/elasticsearch-6.2.4读取data…… | 
看上去好像找不到配置文件之后缺省配置生效了,并且也没出错,但是当你请求一下用hanlp,es就会挂了:
| 1 | [2018-06-30T22:38:18,158][INFO ][o.e.c.s.ClusterApplierService] [3dzf4Ix] new_master {3dzf4Ix}{3dzf4Ix_RgWyUEeodaAkDA}{oT5oXRhcTNuDvM8NbIAqNg}{127.0.0.1}{127.0.0.1:9300}, reason: apply cluster state (from master [master {3dzf4Ix}{3dzf4Ix_RgWyUEeodaAkDA}{oT5oXRhcTNuDvM8NbIAqNg}{127.0.0.1}{127.0.0.1:9300} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]]) [2018-06-30T22:38:18,183][INFO ][o.e.h.n.Netty4HttpServerTransport] [3dzf4Ix] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200} [2018-06-30T22:38:18,183][INFO ][o.e.n.Node ] [3dzf4Ix] started [2018-06-30T22:38:18,197][INFO ][o.e.g.GatewayService ] [3dzf4Ix] recovered [0] indices into cluster_state [2018-06-30T22:43:31,191][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[3dzf4Ix][index][T#1]], exiting java.lang.ExceptionInInitializerError: null at com.hankcs.hanlp.seg.common.Vertex.newB(Vertex.java:462) ~[?:?] at com.hankcs.hanlp.seg.common.WordNet.<init>(WordNet.java:73) ~[?:?] at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:40) ~[?:?] at com.hankcs.hanlp.seg.Segment.seg(Segment.java:557) ~[?:?] at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:98) ~[?:?] at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:67) ~[?:?] at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:266) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:243) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:164) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:80) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:293) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:286) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:844) [?:?] Caused by: java.lang.IllegalArgumentException: 核心词典data/dictionary/CoreNatureDictionary.txt加载失败 at com.hankcs.hanlp.dictionary.CoreDictionary.<clinit>(CoreDictionary.java:44) ~[?:?] ... 20 more ➜ bin | 
那么问题来了,为什么会找不到hanlp.properties呢?
debug
在经过多次google之后也不知道为什么,文档也没有说,只有找源码。提示是在com.hankcs.hanlp.HanLP这个类里出现的,所以去找HanLP的源码:https://github.com/hankcs/HanLP
将它下下来之后全文搜索『没有找到hanlp.properties』,只有一个地方,就是src/main/java/com/hanks/hanlp/HanLP.java,关键的地方是:
| 1 | p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ? | 
在源码里,Predefine.HANLP_PROPERTIES_PATH只有声明,并没赋值,因此当时以为这里造成问题的是loader.getResourceAsStream,所以尝试将hanlp.properties放到各种地方,然而都无效。后来想着尝试在加载失败之后执行多一个加载逻辑,于是改写了源码,重新打包替代插件里的hanlp-1.6.4.jar,结果发现其实在插件运行的时候,Predefine.HANLP_PROPERTIES_PATH其实是有值的,值为analysis-hanlp/hanlp.properties,因为我修改的逻辑是没有吃掉异常的,所以在初始化FileInputStream的时候,找不到文件时打印异常会将文件的绝对路径打印出来,才发现这个问题。看看HanLP.java的源码,也是挺无语的。。。
| 1 | Properties p = new Properties(); | 
吃掉异常,然后提示的信息又没什么用。。。。而且报异常的时候不打印,反而去判断另一个文件是否存在,根据另一个文件是否存在,反推hanlp.properties文件是否存在……简单点,说话的方式简单点……
所以,令配置文件的路径为{es_home}/config/analysis-hanlp/hanlp.properties即可,你可以通过链接来实现,也可以将文件拷贝过去,随你。
然后重启es,再在kibana里发送:
| 1 | GET /_analyze?pretty=true | 
(插件的github README里那段是不行的)
正常的话会得到:
| 1 | { | 
至此完毕。