What is HAWQ?
HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively.
HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. More specifically, HAWQ has the following features:
- On-premise or cloud deployment
- Robust ANSI SQL compliance: SQL-92, SQL-99, SQL-2003, OLAP extension
- Extremely high performance- many times faster than other Hadoop SQL engines
- World-class parallel optimizer
- Full transaction capability and consistency guarantee: ACID
- Dynamic data flow engine through high speed UDP based interconnect
- Elastic execution engine based on on-demand virtual segments and data locality
- Support multiple level partitioning and List/Range based partitioned tables.
- Multiple compression method support: snappy, gzip
- Multi-language user defined function support: Python, Perl, Java, C/C++, R
- Advanced machine learning and data mining functionalities through MADLib
- Dynamic node expansion: in seconds
- Most advanced three level resource management: Integrate with YARN and hierarchical resource queues.
- Easy access of all HDFS data and external system data (for example, HBase)
- Hadoop Native: from storage (HDFS), resource management (YARN) to deployment (Ambari).
- Authentication & granular authorization: Kerberos, SSL and role based access
- Advanced C/C++ access library to HDFS and YARN: libhdfs3 and libYARN
- Support for most third party tools: Tableau, SAS et al.
- Standard connectivity: JDBC/ODBC
HAWQ breaks complex queries into small tasks and distributes them to MPP query processing units for execution.
HAWQ’s basic unit of parallelism is the segment instance. Multiple segment instances on commodity servers work together to form a single parallel query processing system. A query submitted to HAWQ is optimized, broken into smaller components, and dispatched to segments that work together to deliver a single result set. All relational operations - such as table scans, joins, aggregations, and sorts - simultaneously execute in parallel across the segments. Data from upstream components in the dynamic pipeline are transmitted to downstream components through the scalable User Datagram Protocol (UDP) interconnect.
Based on Hadoop’s distributed storage, HAWQ has no single point of failure and supports fully-automatic online recovery. System states are continuously monitored, therefore if a segment fails, it is automatically removed from the cluster. During this process, the system continues serving customer queries, and the segments can be added back to the system when necessary.
These topics provide more information about HAWQ and its main components: