clickhouse join和并发测试

集群规模

机器数量	内存大小
6台	40G

数据规模

数据名称	数据量	总数据量	时间范围
数据A	3千万（每天）	9千万	3天
数据B	6亿（每天）	18亿	3天

测试场景

通过数据A查出一定时间范围内的条件a数据，作为数据B的查询条件，大约4k
以时间、a为条件到数据库中选择两个时间片进行自连接，获取当前分钟的数据和上一分钟的数据。

sql模板

select
	a,
	unix_time,
	count1,
	count2
from
	(
		select
			a,
			unix_time,
			sum( hit ) as count1
		from
			db.tableA
		where
			dt = '2019-03-30'
			and unix_time >= 1553918100
			and unix_time <= 1553919000
			and a in(
			xxxx
)
		group by
			a,
			unix_time
	) all left join(
		select
			a,
			unix_time,
			sum( hit ) as count2
		from
			db.tableB
		where
			dt = '2019-03-30'
			and unix_time >= 1553918400
			and unix_time <= 1553919300
			and a in(
				xxxx
			)
		group by
			a,
			unix_time
	)
		using(
		a,
		unix_time
	)
where
	unix_time >= 1553918400
	and unix_time <= 1553919300

压测结果

单次查询在无并发的情况下查询耗时为8.2s

每隔1s调用一次数据

线程数	最大耗时	平均耗时	95分耗时	时间跨度
1	12.9s	8.0s	10.5s	15min
5	29s	16.6s	21s	15min
10	34s	22s	34s	15min

线程数	最大耗时	平均耗时	95分耗时	时间跨度
1	16.1s	10.0s	16.1s	30min
5	28s	13s	28s	30min

线程数	数据库压力	时间跨度
1	无	15分钟
5	查询有压力	15分钟
10	查询可能出现无响应	15分钟
20	查询无响应	15分钟

线程数	数据库压力	时间跨度
1	无	30分钟
5	查询有压力	30分钟
10	查询能出现无响应	30分钟

summary

clickhouse的并发性能并不是很好，在复杂查询场景下，并发过大会出现无响应的情况，并且在并发场景下，查询耗时都会一起受影响，原先8s的查询在并发下增长。下面是clickhouse单台机器扫描数据的耗时。一开始tableB是通过a作为第一索引，可是查询的时候用4k个a作为查询条件，而且还带有很多其他条件，导致单台服务器就要扫描7亿的数据，一天的数据才16亿，6台服务器平均一台才3亿，重复扫了将近两倍的数据，因此在场景明确的情况下，时间当做第一索引可以减轻扫描压力，如果是查询时间范围很大的查询，将a条件作为第一索引。有必要可以多存储一份另外索引的数据，用空间换时间。

扫描数据量	时间
4亿	4s
7亿	5s