Table Distribution and Storage
HAWQ stores all table data, except the system table, in HDFS. When a user creates a table, the metadata is stored on the master’s local file system and the table content is stored in HDFS.
In order to simplify table data management, all the data of one relation are saved under one HDFS folder.
For all HAWQ table storage formats, AO (Append-Only) and Parquet, the data files are splittable, so that HAWQ can assign multiple virtual segments to consume one data file concurrently. This increases the degree of query parallelism.
Table Distribution Policy
The default table distribution policy in HAWQ is random.
Randomly distributed tables have some benefits over hash distributed tables. For example, after cluster expansion, HAWQ can use more resources automatically without redistributing the data. For huge tables, redistribution is very expensive, and data locality for randomly distributed tables is better after the underlying HDFS redistributes its data during rebalance or DataNode failures. This is quite common when the cluster is large.
On the other hand, for some queries, hash distributed tables are faster than randomly distributed tables. For example, hash distributed tables have some performance benefits for some TPC-H queries. You should choose the distribution policy that is best suited for your application’s scenario.
See Choosing the Table Distribution Policy for more details.
Data Locality
Data is distributed across HDFS DataNodes. Since remote read involves network I/O, a data locality algorithm improves the local read ratio. HAWQ considers three aspects when allocating data blocks to virtual segments:
- Ratio of local read
- Continuity of file read
- Data balance among virtual segments
External Data Access
HAWQ can access data in external files using the HAWQ Extension Framework (PXF). PXF is an extensible framework that allows HAWQ to access data in external sources as readable or writable HAWQ tables. PXF has built-in connectors for accessing data inside HDFS files, Hive tables, and HBase tables. PXF also integrates with HCatalog to query Hive tables directly. See Using PXF with Unmanaged Data for more details.
Users can create custom PXF connectors to access other parallel data stores or processing engines. Connectors are Java plug-ins that use the PXF API. For more information see PXF External Tables and API.