Lesson 5 - HAWQ Tables
HAWQ writes data to, and reads data from, HDFS natively. HAWQ tables are similar to tables in any relational database, except that table rows (data) are distributed across the different segments in the cluster.
In this exercise, you will run scripts that use the SQL CREATE TABLE command to create HAWQ tables. You will load the Retail demo fact data into the HAWQ tables using the SQL COPY command. You will then perform simple and complex queries on the data.
Prerequisites
Ensure that you have:
- Set Up your HAWQ Runtime Environment
- Created the HAWQ Tutorial Database
- Downloaded the Retail Data and Script Files
- Created the Retail Demo HAWQ Schema
- Started your HAWQ cluster.
Exercise: Create, Add Data to, and Query HAWQ Retail Demo Tables
Perform the following steps to create and load HAWQ tables from the sample Retail demo data set.
Navigate to the HAWQ script directory:
gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/hawqCreate tables for the Retail demo fact data using the script provided:
gpadmin@master$ psql -f ./create_hawq_tables.sql psql:./create_hawq_tables.sql:2: NOTICE: table "order_lineitems_hawq" does not exist, skipping DROP TABLE CREATE TABLE psql:./create_hawq_tables.sql:41: NOTICE: table "orders_hawq" does not exist, skipping DROP TABLE CREATE TABLENote: The
create_hawq_tables.sqlscript deletes each table before attempting to create it. If this is your first time performing this exercise, you can safely ignore thepsql“table does not exist, skipping” messages.)Let’s take a look at the
create_hawq_tables.sqlscript; for example:gpadmin@master$ vi create_hawq_tables.sqlNotice the use of the
retail_demo.schema name prefix to theorder_lineitems_hawqtable name:DROP TABLE IF EXISTS retail_demo.order_lineitems_hawq; CREATE TABLE retail_demo.order_lineitems_hawq ( order_id TEXT, order_item_id TEXT, product_id TEXT, product_name TEXT, customer_id TEXT, store_id TEXT, item_shipment_status_code TEXT, order_datetime TEXT, ship_datetime TEXT, item_return_datetime TEXT, item_refund_datetime TEXT, product_category_id TEXT, product_category_name TEXT, payment_method_code TEXT, tax_amount TEXT, item_quantity TEXT, item_price TEXT, discount_amount TEXT, coupon_code TEXT, coupon_amount TEXT, ship_address_line1 TEXT, ship_address_line2 TEXT, ship_address_line3 TEXT, ship_address_city TEXT, ship_address_state TEXT, ship_address_postal_code TEXT, ship_address_country TEXT, ship_phone_number TEXT, ship_customer_name TEXT, ship_customer_email_address TEXT, ordering_session_id TEXT, website_url TEXT ) WITH (appendonly=true, compresstype=zlib) DISTRIBUTED RANDOMLY;The
CREATE TABLEstatement above creates a table namedorder_lineitems_hawqin theretail_demoschema.order_lineitems_hawqhas several columns.order_idandcustomer_idprovide keys into the orders fact and customers dimension tables. The data inorder_lineitems_hawqis distributed randomly and is compressed using thezlibcompression algorithm.The
create_hawq_tables.sqlscript also creates theorders_hawqfact table.Take a look at the
load_hawq_tables.shscript:gpadmin@master$ vi load_hawq_tables.shAgain, notice the use of the
retail_demo.schema name prefix to the table names.Examine the
psql -cCOPYcommands:zcat $DATADIR/order_lineitems.tsv.gz | psql -d hawqgsdb -c "COPY retail_demo.order_lineitems_hawq FROM STDIN DELIMITER E'\t' NULL E'';" zcat $DATADIR/orders.tsv.gz | psql -d hawqgsdb -c "COPY retail_demo.orders_hawq FROM STDIN DELIMITER E'\t' NULL E'';"The
load_hawq_tables.shshell script uses thezcatcommand to uncompress the.tsv.gzdata files. The SQLCOPYcommand copiesSTDIN(i.e. the output of thezcatcommand) to the HAWQ table. TheCOPYcommand also identifies theDELIMITERused in the file (tab) and theNULLstring (“).Use the
load_hawq_tables.shscript to load the Retail demo fact data into the newly-created tables. This process may take some time to complete.gpadmin@master$ ./load_hawq_tables.shUse the provided script to verify that the Retail demo fact tables were loaded successfully:
gpadmin@master$ ./verify_load_hawq_tables.shThe output of the
verify_load_hawq_tables.shscript should match the following:Table Name | Count ------------------------------+------------------------ order_lineitems_hawq | 744196 orders_hawq | 512071 ------------------------------+------------------------Run a query on the
order_lineitems_hawqtable that returns theproduct_id,item_quantity,item_price, andcoupon_amountfor all order line items associated with order id8467975147:gpadmin@master$ psql hawqgsdb=# SELECT product_id, item_quantity, item_price, coupon_amount FROM retail_demo.order_lineitems_hawq WHERE order_id='8467975147' ORDER BY item_price; product_id | item_quantity | item_price | coupon_amount ------------+---------------+------------+--------------- 1611429 | 1 | 11.38 | 0.00000 1035114 | 1 | 12.95 | 0.15000 1382850 | 1 | 17.56 | 0.50000 1562908 | 1 | 18.50 | 0.00000 1248913 | 1 | 34.99 | 0.50000 741706 | 1 | 45.99 | 0.00000 (6 rows)The
ORDER BYclause identifies the sort column,item_price. If you do not specify anORDER BYcolumn(s), the rows are returned in the order in which they were added to the table.Determine the top three postal codes by order revenue by running the following query on the
orders_hawqtable:hawqgsdb=# SELECT billing_address_postal_code, sum(total_paid_amount::float8) AS total, sum(total_tax_amount::float8) AS tax FROM retail_demo.orders_hawq GROUP BY billing_address_postal_code ORDER BY total DESC LIMIT 3;Notice the use of the
sum()aggregate function to add the order totals (total_amount_paid) and tax totals (total_tax_paid) for all orders. These totals are grouped/summed for eachbilling_address_postal_code.Compare your output to the following:
billing_address_postal_code | total | tax ----------------------------+-----------+----------- 48001 | 111868.32 | 6712.0992 15329 | 107958.24 | 6477.4944 42714 | 103244.58 | 6194.6748 (3 rows)Run the following query on the
orders_hawqandorder_lineitems_hawqtables to display theproduct_id,item_quantity, anditem_pricefor all line items identifying aproduct_idof1869831:hawqgsdb=# SELECT retail_demo.order_lineitems_hawq.order_id, product_id, item_quantity, item_price FROM retail_demo.order_lineitems_hawq, retail_demo.orders_hawq WHERE retail_demo.order_lineitems_hawq.order_id=retail_demo.orders_hawq.order_id AND retail_demo.order_lineitems_hawq.product_id=1869831 ORDER BY retail_demo.order_lineitems_hawq.order_id, product_id; order_id | product_id | item_quantity | item_price ------------+------------+---------------+------------ 4831097728 | 1869831 | 1 | 11.87 6734073469 | 1869831 | 1 | 11.87 (2 rows)Exit the
psqlsubsystem:hawqgsdb=# \q
Summary
In this lesson, you created and loaded Retail order and order line item data into HAWQ fact tables. You also queried these tables, learning how to filter the data to your needs.
In Lesson 6, you use PXF external tables to similarly access dimension data stored in HDFS.
Lesson 6: HAWQ Extension Framework (PXF)