Lesson 5 - HAWQ Tables
HAWQ writes data to, and reads data from, HDFS natively. HAWQ tables are similar to tables in any relational database, except that table rows (data) are distributed across the different segments in the cluster.
In this exercise, you will run scripts that use the SQL CREATE TABLE
command to create HAWQ tables. You will load the Retail demo fact data into the HAWQ tables using the SQL COPY
command. You will then perform simple and complex queries on the data.
Prerequisites
Ensure that you have:
- Set Up your HAWQ Runtime Environment
- Created the HAWQ Tutorial Database
- Downloaded the Retail Data and Script Files
- Created the Retail Demo HAWQ Schema
- Started your HAWQ cluster.
Exercise: Create, Add Data to, and Query HAWQ Retail Demo Tables
Perform the following steps to create and load HAWQ tables from the sample Retail demo data set.
Navigate to the HAWQ script directory:
gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/hawq
Create tables for the Retail demo fact data using the script provided:
gpadmin@master$ psql -f ./create_hawq_tables.sql psql:./create_hawq_tables.sql:2: NOTICE: table "order_lineitems_hawq" does not exist, skipping DROP TABLE CREATE TABLE psql:./create_hawq_tables.sql:41: NOTICE: table "orders_hawq" does not exist, skipping DROP TABLE CREATE TABLE
Note: The
create_hawq_tables.sql
script deletes each table before attempting to create it. If this is your first time performing this exercise, you can safely ignore thepsql
“table does not exist, skipping” messages.)Let’s take a look at the
create_hawq_tables.sql
script; for example:gpadmin@master$ vi create_hawq_tables.sql
Notice the use of the
retail_demo.
schema name prefix to theorder_lineitems_hawq
table name:DROP TABLE IF EXISTS retail_demo.order_lineitems_hawq; CREATE TABLE retail_demo.order_lineitems_hawq ( order_id TEXT, order_item_id TEXT, product_id TEXT, product_name TEXT, customer_id TEXT, store_id TEXT, item_shipment_status_code TEXT, order_datetime TEXT, ship_datetime TEXT, item_return_datetime TEXT, item_refund_datetime TEXT, product_category_id TEXT, product_category_name TEXT, payment_method_code TEXT, tax_amount TEXT, item_quantity TEXT, item_price TEXT, discount_amount TEXT, coupon_code TEXT, coupon_amount TEXT, ship_address_line1 TEXT, ship_address_line2 TEXT, ship_address_line3 TEXT, ship_address_city TEXT, ship_address_state TEXT, ship_address_postal_code TEXT, ship_address_country TEXT, ship_phone_number TEXT, ship_customer_name TEXT, ship_customer_email_address TEXT, ordering_session_id TEXT, website_url TEXT ) WITH (appendonly=true, compresstype=zlib) DISTRIBUTED RANDOMLY;
The
CREATE TABLE
statement above creates a table namedorder_lineitems_hawq
in theretail_demo
schema.order_lineitems_hawq
has several columns.order_id
andcustomer_id
provide keys into the orders fact and customers dimension tables. The data inorder_lineitems_hawq
is distributed randomly and is compressed using thezlib
compression algorithm.The
create_hawq_tables.sql
script also creates theorders_hawq
fact table.Take a look at the
load_hawq_tables.sh
script:gpadmin@master$ vi load_hawq_tables.sh
Again, notice the use of the
retail_demo.
schema name prefix to the table names.Examine the
psql -c
COPY
commands:zcat $DATADIR/order_lineitems.tsv.gz | psql -d hawqgsdb -c "COPY retail_demo.order_lineitems_hawq FROM STDIN DELIMITER E'\t' NULL E'';" zcat $DATADIR/orders.tsv.gz | psql -d hawqgsdb -c "COPY retail_demo.orders_hawq FROM STDIN DELIMITER E'\t' NULL E'';"
The
load_hawq_tables.sh
shell script uses thezcat
command to uncompress the.tsv.gz
data files. The SQLCOPY
command copiesSTDIN
(i.e. the output of thezcat
command) to the HAWQ table. TheCOPY
command also identifies theDELIMITER
used in the file (tab) and theNULL
string (“).Use the
load_hawq_tables.sh
script to load the Retail demo fact data into the newly-created tables. This process may take some time to complete.gpadmin@master$ ./load_hawq_tables.sh
Use the provided script to verify that the Retail demo fact tables were loaded successfully:
gpadmin@master$ ./verify_load_hawq_tables.sh
The output of the
verify_load_hawq_tables.sh
script should match the following:Table Name | Count ------------------------------+------------------------ order_lineitems_hawq | 744196 orders_hawq | 512071 ------------------------------+------------------------
Run a query on the
order_lineitems_hawq
table that returns theproduct_id
,item_quantity
,item_price
, andcoupon_amount
for all order line items associated with order id8467975147
:gpadmin@master$ psql hawqgsdb=# SELECT product_id, item_quantity, item_price, coupon_amount FROM retail_demo.order_lineitems_hawq WHERE order_id='8467975147' ORDER BY item_price; product_id | item_quantity | item_price | coupon_amount ------------+---------------+------------+--------------- 1611429 | 1 | 11.38 | 0.00000 1035114 | 1 | 12.95 | 0.15000 1382850 | 1 | 17.56 | 0.50000 1562908 | 1 | 18.50 | 0.00000 1248913 | 1 | 34.99 | 0.50000 741706 | 1 | 45.99 | 0.00000 (6 rows)
The
ORDER BY
clause identifies the sort column,item_price
. If you do not specify anORDER BY
column(s), the rows are returned in the order in which they were added to the table.Determine the top three postal codes by order revenue by running the following query on the
orders_hawq
table:hawqgsdb=# SELECT billing_address_postal_code, sum(total_paid_amount::float8) AS total, sum(total_tax_amount::float8) AS tax FROM retail_demo.orders_hawq GROUP BY billing_address_postal_code ORDER BY total DESC LIMIT 3;
Notice the use of the
sum()
aggregate function to add the order totals (total_amount_paid
) and tax totals (total_tax_paid
) for all orders. These totals are grouped/summed for eachbilling_address_postal_code
.Compare your output to the following:
billing_address_postal_code | total | tax ----------------------------+-----------+----------- 48001 | 111868.32 | 6712.0992 15329 | 107958.24 | 6477.4944 42714 | 103244.58 | 6194.6748 (3 rows)
Run the following query on the
orders_hawq
andorder_lineitems_hawq
tables to display theproduct_id
,item_quantity
, anditem_price
for all line items identifying aproduct_id
of1869831
:hawqgsdb=# SELECT retail_demo.order_lineitems_hawq.order_id, product_id, item_quantity, item_price FROM retail_demo.order_lineitems_hawq, retail_demo.orders_hawq WHERE retail_demo.order_lineitems_hawq.order_id=retail_demo.orders_hawq.order_id AND retail_demo.order_lineitems_hawq.product_id=1869831 ORDER BY retail_demo.order_lineitems_hawq.order_id, product_id; order_id | product_id | item_quantity | item_price ------------+------------+---------------+------------ 4831097728 | 1869831 | 1 | 11.87 6734073469 | 1869831 | 1 | 11.87 (2 rows)
Exit the
psql
subsystem:hawqgsdb=# \q
Summary
In this lesson, you created and loaded Retail order and order line item data into HAWQ fact tables. You also queried these tables, learning how to filter the data to your needs.
In Lesson 6, you use PXF external tables to similarly access dimension data stored in HDFS.
Lesson 6: HAWQ Extension Framework (PXF)