Lesson 4 - Sample Data Set and HAWQ Schemas
The sample Retail demo data set used in the tutorial exercises models an online retail store operation. The store carries different categories of products. Customers order the products. The company delivers the products to the customers.
This and later exercises operate on this example data set. The data set is provided in a set of gzip
’d .tsv
(tab-separated values) text files. The exercises also reference scripts and other supporting files that operate on the data set.
In this section, you are introduced to the Retail demo data schema. You will download and examine the data set and work files. You will also load some of the data into HDFS.
Prerequisites
Ensure that you have Created the HAWQ Tutorial Database and that your HAWQ cluster is up and running.
Exercise: Download the Retail Demo Data and Script Files
Perform the following steps to download the sample data set and scripts:
Open a terminal window and log in to the HAWQ master node as the
gpadmin
user:$ ssh gpadmin@<master>
Create a working directory for the data files and scripts:
gpadmin@master$ mkdir /tmp/hawq_getstart gpadmin@master$ cd /tmp/hawq_getstart
You may choose a different base work directory. If you do, ensure that all path components up to and including the
hawq_getstart
directory have read and execute permissions for all.Download the tutorial work and data files from
github
, checking out the appropriate tag/branch:gpadmin@master$ git clone https://github.com/pivotalsoftware/hawq-samples.git Cloning into 'hawq-samples'... remote: Counting objects: 42, done. remote: Total 42 (delta 0), reused 0 (delta 0), pack-reused 42 Unpacking objects: 100% (42/42), done. Checking out files: 100% (18/18), done. gpadmin@master$ cd hawq-samples gpadmin@master$ git checkout hawq2x_tutorial
Save the path to the work files base directory:
gpadmin@master$ export HAWQGSBASE=/tmp/hawq_getstart/hawq-samples
(If you chose a different base work directory, modify the command as appropriate.)
Add the
$HAWQGSBASE
environment variable setting to your.bash_profile
.Examine the tutorial files. Exercises in this guide reference data files and SQL and shell scripts residing in the
hawq-samples
repository. Specifically:Directory Content datasets/retail/ Retail demo data set data files ( .tsv.gz
format)tutorials/getstart/ Getting Started with HAWQ guide work files tutorials/getstart/hawq/ SQL and shell scripts used by the HAWQ tables exercises tutorials/getstart/pxf/ SQL and shell scripts used by the PXF exercises (
hawq-samples
repository directories not mentioned in the table above are not used by the Getting Started with HAWQ exercises.)
Exercise: Create the Retail Demo HAWQ Schema
A HAWQ schema is a namespace for a database. It contains named objects like tables, data types, functions, and operators. Access these objects by qualifying their name with the prefix <schema-name>
.
Perform the following steps to create the Retail demo data schema:
Start the
psql
subsystem:gpadmin@master$ psql hawqgsdb=#
You are connected to the
hawqgsdb
database.List the HAWQ schemas:
hawqgsdb=# \dn List of schemas Name | Owner --------------------+--------- hawq_toolkit | gpadmin information_schema | gpadmin pg_aoseg | gpadmin pg_bitmapindex | gpadmin pg_catalog | gpadmin pg_toast | gpadmin public | gpadmin (7 rows)
Every database includes a schema named
public
. Database objects you create without specifying a schema are created in the default schema. The default HAWQ schema is thepublic
schema, unless you explicitly set it to another schema. (More about this later.)Display the tables in the
public
schema:hawqgsdb=#\dt public.* List of relations Schema | Name | Type | Owner | Storage --------+-----------+-------+---------+------------- public | first_tbl | table | gpadmin | append only (1 row)
In Lesson 3, you created the
first_tbl
table in thepublic
schema.Create a schema named
retail_demo
to represent the Retail demo namespace:hawqgsdb=# CREATE SCHEMA retail_demo; CREATE SCHEMA
The
search_path
server configuration parameter identifies the order in which HAWQ should search or apply schemas for objects. Set the schema search path to include the newretail_demo
schema first:hawqgsdb=# SET search_path TO retail_demo, public; SET
retail_demo
, the first schema in yoursearch_path
, becomes your default schema.Note: Setting
search_path
in this manner sets the parameter only for the currentpsql
session. You must re-setsearch_path
in subsequentpsql
sessions.Create another table named
first_tbl
:hawqgsdb=# CREATE TABLE first_tbl( i int ); CREATE TABLE hawqgsdb=# INSERT INTO first_tbl SELECT generate_series(100,103); INSERT 0 4 hawqgsdb=# SELECT * FROM first_tbl; i ----- 100 101 102 103 (4 rows)
HAWQ creates this table named
first_tbl
in your default schema since no schema was explicitly identified for the table. Your default schema isretail_demo
due to your currentsearch_path
schema ordering.Verify that this
first_tbl
was created in theretail_demo
schema by displaying the tables in this schema:hawqgsdb=#\dt retail_demo.* List of relations Schema | Name | Type | Owner | Storage -------------+----------------------+-------+---------+------------- retail_demo | first_tbl | table | gpadmin | append only (1 row)
Query the
first_tbl
table that you created in Lesson 3:hawqgsdb=# SELECT * from public.first_tbl; i --- 1 2 3 4 5 (5 rows)
You must prepend the table name with
public.
to explicitly identify thefirst_tbl
table in which you are interested.Exit
psql
:hawqgsdb=# \q
Exercise: Load the Dimension Data to HDFS
The Retail demo data set includes the entities described in the table below. A fact table consists of business facts. Orders and order line items are fact tables. Dimension tables provide descriptive information for the measurements in a fact table. The other entities are represented in dimension tables.
Entity | Description |
---|---|
customers_dim | Customer data: first/last name, id, gender |
customer_addresses_dim | Address and phone number of each customer |
email_addresses_dim | Customer e-mail addresses |
categories_dim | Product category name, id |
products_dim | Product details including name, id, category, and price |
date_dim | Date information including year, quarter, month, week, day of week |
payment_methods | Payment method code, id |
orders | Details of an order such as the id, payment method, billing address, day/time, and other fields. Each order is associated with a specific customer. |
order_lineitems | Details of an order line item such as the id, item id, category, store, shipping address, and other fields. Each line item references a specific product from a specific order from a specific customer. |
Perform the following steps to load the Retail demo dimension data into HDFS for later consumption:
Navigate to the PXF script directory:
gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/pxf
Using the provided script, load the sample data files representing dimension data into an HDFS directory named
/retail_demo
. The script removes any existing/retail_demo
directory and contents before loading the data:gpadmin@master$ ./load_data_to_HDFS.sh running: sudo -u hdfs hdfs -rm -r -f -skipTrash /retail_demo sudo -u hdfs hdfs dfs -mkdir /retail_demo/categories_dim sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/categories_dim.tsv.gz /retail_demo/categories_dim/ sudo -u hdfs hdfs dfs -mkdir /retail_demo/customer_addresses_dim sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/customer_addresses_dim.tsv.gz /retail_demo/customer_addresses_dim/ ...
load_to_HDFS.sh
loads the dimension data.tsv.gz
files directly into HDFS. Each file is loaded to its respective/retail_demo/<basename>/<basename>.tsv.gz
file path.View the contents of the HDFS
/retail_demo
directory hierarchy:gpadmin@master$ sudo -u hdfs hdfs dfs -ls /retail_demo/* -rw-r--r-- 3 hdfs hdfs 590 2017-04-10 19:59 /retail_demo/categories_dim/categories_dim.tsv.gz Found 1 items -rw-r--r-- 3 hdfs hdfs 53995977 2017-04-10 19:59 /retail_demo/customer_addresses_dim/customer_addresses_dim.tsv.gz Found 1 items -rw-r--r-- 3 hdfs hdfs 4646775 2017-04-10 19:59 /retail_demo/customers_dim/customers_dim.tsv.gz Found 1 items ... Because the retail demo data exists only as `.tsv.gz` files in HDFS, you cannot immediately query the data using HAWQ. In the next lesson, you create HAWQ external tables that reference these data files, after which you can query them via PXF.
Summary
In this lesson, you downloaded the tutorial data set and work files, created the retail_demo
HAWQ schema, and loaded the Retail demo dimension data into HDFS.
In Lessons 5 and 6, you will create and query HAWQ internal and external tables in the retail_demo
schema.
Lesson 5: HAWQ Tables