Aapche Drill:一款分布式查询引擎

Aapche Drill 是什么?

Apache Drill™ is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google’s Dremel.

Drill

(图片来源:Pexels™ 官网,已确认无版权)

优缺点

优势

  • 支持自定义的嵌套数据结构
  • 兼容 Hive(包括 Hive 的 UDF,且支持自定义 UDF)
  • 高性能、低延迟的 SQL 查询
  • 支持多数据源(插件化,包括 Apache KafkaApache HBase、Apache Hive、OpenTSDB、S3
UDF(User Defined Funcation):用户定义普通函数,只作用于单行记录
UDAF(User Defined Aggregation Funcation):用户定义聚合函数,只作用于多行记录
UDTF(User Defined Table Generating Funcation):用户定义表生成函数,可以输入一行记录输出多行记录

劣势

  • 与标准 SQL 略有不同
  • 外部依赖较多(基于 Apache ZooKeeper 实现分布式、基于 Apache Calcite 实现 SQL 解析)
  • 比较小众,相关资料缺乏

比对

Apache Drill vs Presto

Apache Drill Presto
针对领域 非关系型数据库 分布式数据库
企业级
成功案例 MapR™ Teradata™
部署 较繁琐 较快捷

实战

下载

1
2
3
4
$ wget https://mirror.bit.edu.cn/apache/drill/drill-1.18.0/apache-drill-1.18.0.tar.gz
$ tar zxvf apache-drill-1.18.0.tar.gz
$ ln -s apache-drill-1.18.0 drill
$ cd drill

启动

1
$ bin/drill-embedded
1
2
Apache Drill 1.18.0
"In Drill We Trust."

展示所有 Tables

1
apache drill> !tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
+-----------+--------------------+----------------------+--------------+---------+----------+------------+-----------+---------------------------+----------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_CAT | TYPE_SCHEM | TYPE_NAME | SELF_REFERENCING_COL_NAME | REF_GENERATION |
+-----------+--------------------+----------------------+--------------+---------+----------+------------+-----------+---------------------------+----------------+
| DRILL | information_schema | CATALOGS | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | information_schema | COLUMNS | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | information_schema | FILES | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | information_schema | PARTITIONS | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | information_schema | SCHEMATA | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | information_schema | TABLES | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | information_schema | VIEWS | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | boot | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | connections | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | drillbits | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | functions | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | internal_options | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | internal_options_old | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | memory | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | options | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | options_old | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | profiles | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | profiles_json | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | threads | SYSTEM TABLE | null | null | null | null | null | null |
| DRILL | sys | version | SYSTEM TABLE | null | null | null | null | null | null |
+-----------+--------------------+----------------------+--------------+---------+----------+------------+-----------+---------------------------+----------------+

查询

1
apache drill> select * from cp.`employee.json` limit 2;
1
2
3
4
5
6
7
+-------------+-----------------+------------+-----------+-------------+--------------------+----------+---------------+------------+-----------------------+---------+---------------+-----------------+----------------+--------+-------------------+
| employee_id | full_name | first_name | last_name | position_id | position_title | store_id | department_id | birth_date | hire_date | salary | supervisor_id | education_level | marital_status | gender | management_role |
+-------------+-----------------+------------+-----------+-------------+--------------------+----------+---------------+------------+-----------------------+---------+---------------+-----------------+----------------+--------+-------------------+
| 1 | Sheri Nowmer | Sheri | Nowmer | 1 | President | 0 | 1 | 1961-08-26 | 1994-12-01 00:00:00.0 | 80000.0 | 0 | Graduate Degree | S | F | Senior Management |
| 2 | Derrick Whelply | Derrick | Whelply | 2 | VP Country Manager | 0 | 1 | 1915-07-03 | 1994-12-01 00:00:00.0 | 40000.0 | 1 | Graduate Degree | M | M | Senior Management |
+-------------+-----------------+------------+-----------+-------------+--------------------+----------+---------------+------------+-----------------------+---------+---------------+-----------------+----------------+--------+-------------------+
2 rows selected (0.281 seconds)

退出

1
apache drill> !quit

踩过的坑

默认时区是 UTC

解决

1
SELECT to_timestamp(CONCAT(`timestamp`, ' +0800'), 'YYYY-MM-dd HH:mm:ss.SSS Z') AS tm FROM `yuzhouwan`.`blog`

资料

Doc

欢迎加入我们的技术群,一起交流学习

群名称 群号
人工智能(高级)
人工智能(进阶)
BigData
算法