elasticsearch去重在JAVA项目中的应用

ES环境配置

  1. ES版本:6.4.0
  2. ES集群服务器 3台
IP 备注
10.80.5.6 master节点
10.80.5.7 data节点
10.80.5.8 data节点

配置文件示例

1
2
3
4
5
6
7
8
9
10
11
12
cluster.name: mic-ga-es # 集群名字
node.name: node-001 # 当前节点名字
node.master: true # 是否是主节点
network.host: 10.80.5.9
node.data: true # 是否是数据节点
path.data: /opt/elasticsearch-6.4.0/data
path.logs: /opt/elasticsearch-6.4.0/logs
transport.tcp.port: 9300
http.port: 9200
discovery.zen.ping.unicast.hosts: ["10.80.5.6:9300","10.80.5.7:9300","10.80.5.8:9300"]
discovery.zen.minimum_master_nodes: 1
  1. 启动命令

/opt/elasticsearch-6.4.0/bin/elasticsearch -d

需求简介

在导入es中时有许多url不同,但标题相同的文档,在查询时需要把这些相同的标题去重,并且支持分页和高亮显示

项目配置

springboot框架项目

引入依赖(gradle)

1
compile 'org.springframework.boot:spring-boot-starter-data-elasticsearch'

yml配置

1
2
3
4
5
spring:
data:
elasticsearch:
cluster-name: mic-ga-es
cluster-nodes: 10.80.5.6:9300,10.80.5.7:9300,10.80.5.8:9300

document结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Document(indexName = "onion_page",shards=15,replicas=1,indexStoreType="fs",createIndex =true)
public class OnionPage {
@Id
private String id;
@Field(type = FieldType.Keyword)
private String url;
@Field(type = FieldType.Keyword)
private String title;
@Field(index=true, type = FieldType.Text)
private String snapshot;
@Field(type = FieldType.Keyword)
private String type;
@Field(type = FieldType.Date)
private Date createTime;
@Field(type = FieldType.Date)
private Date updateTime;
}

业务代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
@Service
public class OnionPageService {
@Autowired
private TransportClient client;
/**
* 去重查询获取内容
*/
public List<OnionPage> searchPageAfterDistinct(String content, Pageable pageable) {
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
// 单字段最高评分作为结果返回,
DisMaxQueryBuilder dis = new DisMaxQueryBuilder();
dis.tieBreaker(0.1f);
// boost相对权重
boolQueryBuilder.should(QueryBuilders.matchPhraseQuery("snapshot", content).slop(5).boost(1f))
.should(QueryBuilders.matchPhraseQuery("title", content).slop(5).boost(1.5f)).should(dis);
// collapse折叠功能 ES5.3之后支持,且只针对keyword字段有效
CollapseBuilder cb = new CollapseBuilder("title");
SearchResponse response = client.prepareSearch("onion_page").setQuery(boolQueryBuilder).setCollapse(cb)
.highlighter(new HighlightBuilder().field("snapshot")).setSize(pageable.getPageSize())
.setFrom(pageable.getPageNumber() - 1).get();
List<OnionPage> list = new ArrayList<>();
SearchHits shList = response.getHits();
for (SearchHit searchHit : shList) {
OnionPage bean = JSONObject.parseObject(searchHit.getSourceAsString(), OnionPage.class);
Text[] fragments = searchHit.getHighlightFields().get("snapshot").getFragments();
bean.setSnapshot(fragments[0].toString());
list.add(bean);
}
return list;
}
/**
* 去重后的总数(使用cardinality)
*/
public Long searchPageAfterDistinctCount(String content) {
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.should(QueryBuilders.matchPhraseQuery("snapshot", content).slop(5))
.should(QueryBuilders.matchPhraseQuery("title", content).slop(5));
SearchResponse response = client.prepareSearch("onion_page").setQuery(boolQueryBuilder)
.addAggregation(AggregationBuilders.cardinality("title_distinct").field("title")).get();
Cardinality titleAgg = response.getAggregations().get("title_distinct");
return titleAgg.getValue();
}
}

小结

ES去重在springboot-data中提供的ElasticsearchTemplate提供的api接口中暂时未被支持,所以需要使用TransportClient来对查询结果进行解析。我的思路是先使用ES的restful API进行一些测试,获取返回的数据结构,然后再在JAVA中使用TransportClient提供的相应方法进行接口调用和数据解析。ElasticsearchTemplate固然对我们使用ES的基本操作方便的很多,但还是需要去理解ES原生的方法,这样才能碰到复杂操作时能够解决问题。