Lesson 4 - Sample Data Set and HAWQ Schemas
The sample Retail demo data set used in the tutorial exercises models an online retail store operation. The store carries different categories of products. Customers order the products. The company delivers the products to the customers.
This and later exercises operate on this example data set. The data set is provided in a set of gzip’d .tsv (tab-separated values) text files. The exercises also reference scripts and other supporting files that operate on the data set.
In this section, you are introduced to the Retail demo data schema. You will download and examine the data set and work files. You will also load some of the data into HDFS.
Prerequisites
Ensure that you have Created the HAWQ Tutorial Database and that your HAWQ cluster is up and running.
Exercise: Download the Retail Demo Data and Script Files
Perform the following steps to download the sample data set and scripts:
Open a terminal window and log in to the HAWQ master node as the
gpadminuser:$ ssh gpadmin@<master>Create a working directory for the data files and scripts:
gpadmin@master$ mkdir /tmp/hawq_getstart gpadmin@master$ cd /tmp/hawq_getstartYou may choose a different base work directory. If you do, ensure that all path components up to and including the
hawq_getstartdirectory have read and execute permissions for all.Download the tutorial work and data files from
github, checking out the appropriate tag/branch:gpadmin@master$ git clone https://github.com/pivotalsoftware/hawq-samples.git Cloning into 'hawq-samples'... remote: Counting objects: 42, done. remote: Total 42 (delta 0), reused 0 (delta 0), pack-reused 42 Unpacking objects: 100% (42/42), done. Checking out files: 100% (18/18), done. gpadmin@master$ cd hawq-samples gpadmin@master$ git checkout hawq2x_tutorialSave the path to the work files base directory:
gpadmin@master$ export HAWQGSBASE=/tmp/hawq_getstart/hawq-samples(If you chose a different base work directory, modify the command as appropriate.)
Add the
$HAWQGSBASEenvironment variable setting to your.bash_profile.Examine the tutorial files. Exercises in this guide reference data files and SQL and shell scripts residing in the
hawq-samplesrepository. Specifically:Directory Content datasets/retail/ Retail demo data set data files ( .tsv.gzformat)tutorials/getstart/ Getting Started with HAWQ guide work files tutorials/getstart/hawq/ SQL and shell scripts used by the HAWQ tables exercises tutorials/getstart/pxf/ SQL and shell scripts used by the PXF exercises (
hawq-samplesrepository directories not mentioned in the table above are not used by the Getting Started with HAWQ exercises.)
Exercise: Create the Retail Demo HAWQ Schema
A HAWQ schema is a namespace for a database. It contains named objects like tables, data types, functions, and operators. Access these objects by qualifying their name with the prefix <schema-name>.
Perform the following steps to create the Retail demo data schema:
Start the
psqlsubsystem:gpadmin@master$ psql hawqgsdb=#You are connected to the
hawqgsdbdatabase.List the HAWQ schemas:
hawqgsdb=# \dn List of schemas Name | Owner --------------------+--------- hawq_toolkit | gpadmin information_schema | gpadmin pg_aoseg | gpadmin pg_bitmapindex | gpadmin pg_catalog | gpadmin pg_toast | gpadmin public | gpadmin (7 rows)Every database includes a schema named
public. Database objects you create without specifying a schema are created in the default schema. The default HAWQ schema is thepublicschema, unless you explicitly set it to another schema. (More about this later.)Display the tables in the
publicschema:hawqgsdb=#\dt public.* List of relations Schema | Name | Type | Owner | Storage --------+-----------+-------+---------+------------- public | first_tbl | table | gpadmin | append only (1 row)In Lesson 3, you created the
first_tbltable in thepublicschema.Create a schema named
retail_demoto represent the Retail demo namespace:hawqgsdb=# CREATE SCHEMA retail_demo; CREATE SCHEMAThe
search_pathserver configuration parameter identifies the order in which HAWQ should search or apply schemas for objects. Set the schema search path to include the newretail_demoschema first:hawqgsdb=# SET search_path TO retail_demo, public; SETretail_demo, the first schema in yoursearch_path, becomes your default schema.Note: Setting
search_pathin this manner sets the parameter only for the currentpsqlsession. You must re-setsearch_pathin subsequentpsqlsessions.Create another table named
first_tbl:hawqgsdb=# CREATE TABLE first_tbl( i int ); CREATE TABLE hawqgsdb=# INSERT INTO first_tbl SELECT generate_series(100,103); INSERT 0 4 hawqgsdb=# SELECT * FROM first_tbl; i ----- 100 101 102 103 (4 rows)HAWQ creates this table named
first_tblin your default schema since no schema was explicitly identified for the table. Your default schema isretail_demodue to your currentsearch_pathschema ordering.Verify that this
first_tblwas created in theretail_demoschema by displaying the tables in this schema:hawqgsdb=#\dt retail_demo.* List of relations Schema | Name | Type | Owner | Storage -------------+----------------------+-------+---------+------------- retail_demo | first_tbl | table | gpadmin | append only (1 row)Query the
first_tbltable that you created in Lesson 3:hawqgsdb=# SELECT * from public.first_tbl; i --- 1 2 3 4 5 (5 rows)You must prepend the table name with
public.to explicitly identify thefirst_tbltable in which you are interested.Exit
psql:hawqgsdb=# \q
Exercise: Load the Dimension Data to HDFS
The Retail demo data set includes the entities described in the table below. A fact table consists of business facts. Orders and order line items are fact tables. Dimension tables provide descriptive information for the measurements in a fact table. The other entities are represented in dimension tables.
| Entity | Description |
|---|---|
| customers_dim | Customer data: first/last name, id, gender |
| customer_addresses_dim | Address and phone number of each customer |
| email_addresses_dim | Customer e-mail addresses |
| categories_dim | Product category name, id |
| products_dim | Product details including name, id, category, and price |
| date_dim | Date information including year, quarter, month, week, day of week |
| payment_methods | Payment method code, id |
| orders | Details of an order such as the id, payment method, billing address, day/time, and other fields. Each order is associated with a specific customer. |
| order_lineitems | Details of an order line item such as the id, item id, category, store, shipping address, and other fields. Each line item references a specific product from a specific order from a specific customer. |
Perform the following steps to load the Retail demo dimension data into HDFS for later consumption:
Navigate to the PXF script directory:
gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/pxfUsing the provided script, load the sample data files representing dimension data into an HDFS directory named
/retail_demo. The script removes any existing/retail_demodirectory and contents before loading the data:gpadmin@master$ ./load_data_to_HDFS.sh running: sudo -u hdfs hdfs -rm -r -f -skipTrash /retail_demo sudo -u hdfs hdfs dfs -mkdir /retail_demo/categories_dim sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/categories_dim.tsv.gz /retail_demo/categories_dim/ sudo -u hdfs hdfs dfs -mkdir /retail_demo/customer_addresses_dim sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/customer_addresses_dim.tsv.gz /retail_demo/customer_addresses_dim/ ...load_to_HDFS.shloads the dimension data.tsv.gzfiles directly into HDFS. Each file is loaded to its respective/retail_demo/<basename>/<basename>.tsv.gzfile path.View the contents of the HDFS
/retail_demodirectory hierarchy:gpadmin@master$ sudo -u hdfs hdfs dfs -ls /retail_demo/* -rw-r--r-- 3 hdfs hdfs 590 2017-04-10 19:59 /retail_demo/categories_dim/categories_dim.tsv.gz Found 1 items -rw-r--r-- 3 hdfs hdfs 53995977 2017-04-10 19:59 /retail_demo/customer_addresses_dim/customer_addresses_dim.tsv.gz Found 1 items -rw-r--r-- 3 hdfs hdfs 4646775 2017-04-10 19:59 /retail_demo/customers_dim/customers_dim.tsv.gz Found 1 items ... Because the retail demo data exists only as `.tsv.gz` files in HDFS, you cannot immediately query the data using HAWQ. In the next lesson, you create HAWQ external tables that reference these data files, after which you can query them via PXF.
Summary
In this lesson, you downloaded the tutorial data set and work files, created the retail_demo HAWQ schema, and loaded the Retail demo dimension data into HDFS.
In Lessons 5 and 6, you will create and query HAWQ internal and external tables in the retail_demo schema.
Lesson 5: HAWQ Tables