Lesson 4 - Sample Data Set and HAWQ Schemas

The sample Retail demo data set used in the tutorial exercises models an online retail store operation. The store carries different categories of products. Customers order the products. The company delivers the products to the customers.

This and later exercises operate on this example data set. The data set is provided in a set of gzip’d .tsv (tab-separated values) text files. The exercises also reference scripts and other supporting files that operate on the data set.

In this section, you are introduced to the Retail demo data schema. You will download and examine the data set and work files. You will also load some of the data into HDFS.

Prerequisites

Ensure that you have Created the HAWQ Tutorial Database and that your HAWQ cluster is up and running.

Exercise: Download the Retail Demo Data and Script Files

Perform the following steps to download the sample data set and scripts:

  1. Open a terminal window and log in to the HAWQ master node as the gpadmin user:

    $ ssh gpadmin@<master>
    
  2. Create a working directory for the data files and scripts:

    gpadmin@master$ mkdir /tmp/hawq_getstart
    gpadmin@master$ cd /tmp/hawq_getstart
    

    You may choose a different base work directory. If you do, ensure that all path components up to and including the hawq_getstart directory have read and execute permissions for all.

  3. Download the tutorial work and data files from github, checking out the appropriate tag/branch:

    gpadmin@master$ git clone https://github.com/pivotalsoftware/hawq-samples.git
    Cloning into 'hawq-samples'...
    remote: Counting objects: 42, done.
    remote: Total 42 (delta 0), reused 0 (delta 0), pack-reused 42
    Unpacking objects: 100% (42/42), done.
    Checking out files: 100% (18/18), done.
    gpadmin@master$ cd hawq-samples
    gpadmin@master$ git checkout hawq2x_tutorial
    
  4. Save the path to the work files base directory:

    gpadmin@master$ export HAWQGSBASE=/tmp/hawq_getstart/hawq-samples
    

    (If you chose a different base work directory, modify the command as appropriate.)

  5. Add the $HAWQGSBASE environment variable setting to your .bash_profile.

  6. Examine the tutorial files. Exercises in this guide reference data files and SQL and shell scripts residing in the hawq-samples repository. Specifically:

    Directory Content
    datasets/retail/ Retail demo data set data files (.tsv.gz format)
    tutorials/getstart/ Getting Started with HAWQ guide work files
    tutorials/getstart/hawq/ SQL and shell scripts used by the HAWQ tables exercises
    tutorials/getstart/pxf/ SQL and shell scripts used by the PXF exercises

    (hawq-samples repository directories not mentioned in the table above are not used by the Getting Started with HAWQ exercises.)

Exercise: Create the Retail Demo HAWQ Schema

A HAWQ schema is a namespace for a database. It contains named objects like tables, data types, functions, and operators. Access these objects by qualifying their name with the prefix <schema-name>.

Perform the following steps to create the Retail demo data schema:

  1. Start the psql subsystem:

    gpadmin@master$ psql
    hawqgsdb=#
    

    You are connected to the hawqgsdb database.

  2. List the HAWQ schemas:

    hawqgsdb=# \dn
           List of schemas
            Name        |  Owner  
    --------------------+---------
     hawq_toolkit       | gpadmin
     information_schema | gpadmin
     pg_aoseg           | gpadmin
     pg_bitmapindex     | gpadmin
     pg_catalog         | gpadmin
     pg_toast           | gpadmin
     public             | gpadmin
    (7 rows)
    

    Every database includes a schema named public. Database objects you create without specifying a schema are created in the default schema. The default HAWQ schema is the public schema, unless you explicitly set it to another schema. (More about this later.)

  3. Display the tables in the public schema:

    hawqgsdb=#\dt public.*
               List of relations
     Schema |    Name   | Type  |  Owner  |   Storage   
    --------+-----------+-------+---------+-------------
     public | first_tbl | table | gpadmin | append only
    (1 row)
    

    In Lesson 3, you created the first_tbl table in the public schema.

  4. Create a schema named retail_demo to represent the Retail demo namespace:

    hawqgsdb=# CREATE SCHEMA retail_demo;
    CREATE SCHEMA
    
  5. The search_path server configuration parameter identifies the order in which HAWQ should search or apply schemas for objects. Set the schema search path to include the new retail_demo schema first:

    hawqgsdb=# SET search_path TO retail_demo, public;
    SET
    

    retail_demo, the first schema in your search_path, becomes your default schema.

    Note: Setting search_path in this manner sets the parameter only for the current psql session. You must re-set search_path in subsequent psql sessions.

  6. Create another table named first_tbl:

    hawqgsdb=# CREATE TABLE first_tbl( i int );
    CREATE TABLE
    hawqgsdb=# INSERT INTO first_tbl SELECT generate_series(100,103);
    INSERT 0 4
    hawqgsdb=# SELECT * FROM first_tbl;
      i  
    -----
     100
     101
     102
     103
    (4 rows)
    

    HAWQ creates this table named first_tbl in your default schema since no schema was explicitly identified for the table. Your default schema is retail_demo due to your current search_path schema ordering.

  7. Verify that this first_tbl was created in the retail_demo schema by displaying the tables in this schema:

    hawqgsdb=#\dt retail_demo.*
                         List of relations
       Schema    |         Name         | Type  |  Owner  |   Storage   
    -------------+----------------------+-------+---------+-------------
     retail_demo | first_tbl            | table | gpadmin | append only
    (1 row)
    
  8. Query the first_tbl table that you created in Lesson 3:

    hawqgsdb=# SELECT * from public.first_tbl;
      i 
    ---
     1
     2
     3
     4
     5
    (5 rows)
    

    You must prepend the table name with public. to explicitly identify the first_tbl table in which you are interested.

  9. Exit psql:

    hawqgsdb=# \q
    

Exercise: Load the Dimension Data to HDFS

The Retail demo data set includes the entities described in the table below. A fact table consists of business facts. Orders and order line items are fact tables. Dimension tables provide descriptive information for the measurements in a fact table. The other entities are represented in dimension tables.

Entity Description
customers_dim Customer data: first/last name, id, gender
customer_addresses_dim Address and phone number of each customer
email_addresses_dim Customer e-mail addresses
categories_dim Product category name, id
products_dim Product details including name, id, category, and price
date_dim Date information including year, quarter, month, week, day of week
payment_methods Payment method code, id
orders Details of an order such as the id, payment method, billing address, day/time, and other fields. Each order is associated with a specific customer.
order_lineitems Details of an order line item such as the id, item id, category, store, shipping address, and other fields. Each line item references a specific product from a specific order from a specific customer.

Perform the following steps to load the Retail demo dimension data into HDFS for later consumption:

  1. Navigate to the PXF script directory:

    gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/pxf
    
  2. Using the provided script, load the sample data files representing dimension data into an HDFS directory named /retail_demo. The script removes any existing /retail_demo directory and contents before loading the data:

    gpadmin@master$ ./load_data_to_HDFS.sh
    running: sudo -u hdfs hdfs -rm -r -f -skipTrash /retail_demo
    sudo -u hdfs hdfs dfs -mkdir /retail_demo/categories_dim
    sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/categories_dim.tsv.gz /retail_demo/categories_dim/
    sudo -u hdfs hdfs dfs -mkdir /retail_demo/customer_addresses_dim
    sudo -u hdfs hdfs dfs -put /tmp/hawq_getstart/hawq-samples/datasets/retail/customer_addresses_dim.tsv.gz /retail_demo/customer_addresses_dim/
    ...
    

    load_to_HDFS.sh loads the dimension data .tsv.gz files directly into HDFS. Each file is loaded to its respective /retail_demo/<basename>/<basename>.tsv.gz file path.

  3. View the contents of the HDFS /retail_demo directory hierarchy:

    gpadmin@master$ sudo -u hdfs hdfs dfs -ls /retail_demo/*
    -rw-r--r--   3 hdfs hdfs        590 2017-04-10 19:59 /retail_demo/categories_dim/categories_dim.tsv.gz
    Found 1 items
    -rw-r--r--   3 hdfs hdfs   53995977 2017-04-10 19:59 /retail_demo/customer_addresses_dim/customer_addresses_dim.tsv.gz
    Found 1 items
    -rw-r--r--   3 hdfs hdfs    4646775 2017-04-10 19:59 /retail_demo/customers_dim/customers_dim.tsv.gz
    Found 1 items
    ...
    
    Because the retail demo data exists only as `.tsv.gz` files in HDFS, you cannot immediately query the data using HAWQ. In the next lesson, you create HAWQ external tables that reference these data files, after which you can query them via PXF.
    

Summary

In this lesson, you downloaded the tutorial data set and work files, created the retail_demo HAWQ schema, and loaded the Retail demo dimension data into HDFS.

In Lessons 5 and 6, you will create and query HAWQ internal and external tables in the retail_demo schema.

Lesson 5: HAWQ Tables