hawq extract

Extracts the metadata of a specified table into a YAML file.

Synopsis

hawq extract [<connection_options>] [-o <output_file>] <tablename>

hawq extract -?

hawq extract --version

where:

<connection_options> =
  [-h <host>] 
  [-p <port>] 
  [-U <username>] 
  [-d <database>]
  [-W]

Description

hawq extract is a utility that extracts a table’s metadata into a YAML formatted file. HAWQ’s InputFormat uses this YAML-formatted file to read a HAWQ file stored on HDFS directly into the MapReduce program. The YAML configuration file can also be used provide the metadata for registering files into HAWQ with the hawq register command.

Note: hawq extract is bound by the following rules:

  • You must start up HAWQ to use hawq extract.
  • hawq extract only supports AO and Parquet tables.
  • hawq extract supports partitioned tables, but does not support sub-partitions.

Arguments

<tablename>
Name of the table that you need to extract metadata. You can use the format namespace_name.table_name.

Options

-o <output_file>
Is the name of a file that hawq extract uses to write the metadata. If you do not specify a name, hawq extract writes to stdout.

-v (verbose mode)
Displays the verbose output of the extraction process.

-? (help)
Displays the online help.

--version
Displays the version of this utility.

<connection_options>

-h <host>
Specifies the host name of the machine on which the HAWQ master database server is running. If not specified, it reads from the environment variable $PGHOST or defaults to localhost.

-p <port>
Specifies the TCP port on which the HAWQ master database server is listening for connections. If not specified, reads from the environment variable $PGPORT or defaults to 5432.

-U <username>
The database role name to connect as. If not specified, reads from the environment variable $PGUSER or defaults to the current system user name.

-d <database>
The database to connect to. If not specified, it reads from the environment variable $PGDATABASE or defaults to template1.

-W (force password prompt)
Force a password prompt. If not specified, reads the password from the environment variable $PGPASSWORD or from a password file specified by $PGPASSFILE or in ~/.pgpass.

Metadata File Format

hawq extract exports the table metadata into a file using YAML 1.1 document format. The file contains various key information about the table, such as table schema, data file locations and sizes, partition constraints and so on.

The basic structure of the metadata file is as follows:

Version: string (1.0.0)
DBVersion: string 
FileFormat: string (AO/Parquet) 
TableName: string (schemaname.tablename)
DFS_URL: string (hdfs://127.0.0.1:9000)
Encoding: UTF8
AO_Schema: 
    - name: string
      type: string
      Bucketnum: 6
      Distribution_policy: DISTRIBUTED RANDOMLY 

AO_FileLocations:
      Blocksize: int
      Checksum: boolean
      CompressionType: string
      CompressionLevel: int
      PartitionBy: string ('PARTITION BY ...')
      Files:
      - path: string (/gpseg0/16385/35469/35470.1)
        size: long

      Partitions:
      - Blocksize: int
        Checksum: Boolean
        CompressionType: string
        CompressionLevel: int
        Name: string
        Constraint: string (PARTITION Jan08 START (date '2008-01-01') INCLUSIVE)
        Files:
        - path: string
          size: long

Parquet_Schema: 
    - name: string
      type: string
       
Parquet_FileLocations:
  RowGroupSize: long
  PageSize: long
  CompressionType: string
  CompressionLevel: int
  Checksum: boolean
  EnableDictionary: boolean
  PartitionBy: string
  Files:
  - path: string
    size: long
  Partitions:
  - Name: string
    RowGroupSize: long
    PageSize: long
    CompressionType: string
    CompressionLevel: int
    Checksum: boolean
    EnableDictionary: boolean
    Constraint: string
    Files:
    - path: string
      size: long

Example - Extracting an AO table

Extract the rank table’s metadata into a file named rank_table.yaml:

$ hawq extract -o rank_table.yaml -d postgres rank

Output content in rank_table.yaml

AO_FileLocations:
    Blocksize: 32768
    Checksum: false
    CompressionLevel: 0
    CompressionType: null
    Files:
    - path: /gpseg0/16385/35469/35692.1
      size: 0
    - path: /gpseg1/16385/35469/35692.1
      size: 0
    PartitionBy: PARTITION BY list (gender)
    Partitions:
    - Blocksize: 32768
      Checksum: false
      CompressionLevel: 0
      CompressionType: null
      Constraint: PARTITION girls VALUES('F') WITH (appendonly=true)
    Files:
    - path: /gpseg0/16385/35469/35697.1
      size: 0
    - path: /gpseg1/16385/35469/35697.1
      size: 0
      Name: girls
    - Blocksize: 32768
      Checksum: false
      CompressionLevel: 0
      CompressionType: null
      Constraint: PARTITION boys VALUES('M') WITH (appendonly=true)
      Files:
      - path: /gpseg0/16385/35469/35703.1
        size: 0
      - path: /gpseg1/16385/35469/35703.1
        size: 0
      Name: boys
    - Blocksize: 32768
      Checksum: false
      CompressionLevel: 0
      CompressionType: null
      Constraint: DEFAULT PARTITION other WITH appendonly=true)
      Files:
      - path: /gpseg0/16385/35469/35709.1
        size: 90071728
      - path: /gpseg1/16385/35469/35709.1
        size: 90071512
      Name: other
    AO_Schema:
    - name: id
      type: int4
    - name: rank
      type: int4
    - name: year
      type: int4
    - name: gender
      type: bpchar
    - name: count
      type: int4
    DFS_URL: hdfs://127.0.0.1:9000
    Distribution_policy: DISTRIBUTED RANDOMLY
    Encoding: UTF8
    FileFormat: AO
    TableName: public.rank
    Version: 1.0.0

Example - Extracting a Parquet table

Extract the orders table’s metadata into a file named orders.yaml:

$ hawq extract -o orders.yaml -d postgres orders

Output content in orders.yaml

DFS_URL: hdfs://127.0.0.1:9000
Encoding: UTF8
FileFormat: Parquet
TableName: public.orders
Version: 1.0.0
Parquet_FileLocations:
  Checksum: false
  CompressionLevel: 0
  CompressionType: none
  EnableDictionary: false
  Files:
  - path: /hawq-data/gpseg0/16385/16626/16657.1
    size: 0
  - path: /hawq-data/gpseg1/16385/16626/16657.1
    size: 0
  PageSize: 1048576
  PartitionBy: PARTITION BY range (o_orderdate)
  Partitions:
  - Checksum: false
    CompressionLevel: 0
    CompressionType: none
    Constraint: PARTITION p1_1 START ('1992-01-01'::date) END ('1994-12-31'::date)
      EVERY ('3 years'::interval) WITH (appendonly=true, orientation=parquet, pagesize=1048576,
      rowgroupsize=8388608, compresstype=none, compresslevel=0)
    EnableDictionary: false
    Files:
    - path: /hawq-data/gpseg0/16385/16626/16662.1
      size: 8140599
    - path: /hawq-data/gpseg1/16385/16626/16662.1
      size: 8099760
    Name: orders_1_prt_p1_1
    PageSize: 1048576
    RowGroupSize: 8388608
  - Checksum: false
    CompressionLevel: 0
    CompressionType: none
    Constraint: PARTITION p1_11 START ('1995-01-01'::date) END ('1997-12-31'::date)
      EVERY ('e years'::interval) WITH (appendonly=true, orientation=parquet, pagesize=1048576,
      rowgroupsize=8388608, compresstype=none, compresslevel=0)
    EnableDictionary: false
    Files:
    - path: /hawq-data/gpseg0/16385/16626/16668.1
      size: 8088559
    - path: /hawq-data/gpseg1/16385/16626/16668.1
      size: 8075056
    Name: orders_1_prt_p1_11
    PageSize: 1048576
    RowGroupSize: 8388608
  - Checksum: false
    CompressionLevel: 0
    CompressionType: none
    Constraint: PARTITION p1_21 START ('1998-01-01'::date) END ('2000-12-31'::date)
      EVERY ('3 years'::interval) WITH (appendonly=true, orientation=parquet, pagesize=1048576,
      rowgroupsize=8388608, compresstype=none, compresslevel=0)
    EnableDictionary: false
    Files:
    - path: /hawq-data/gpseg0/16385/16626/16674.1
      size: 8065770
    - path: /hawq-data/gpseg1/16385/16626/16674.1
      size: 8126669
    Name: orders_1_prt_p1_21
    PageSize: 1048576
    RowGroupSize: 8388608
  RowGroupSize: 8388608
  Parquet_Schema:
  - name: o_orderkey
    type: int8
  - name: o_custkey
    type: int4
  - name: o_orderstatus
    type: bpchar
  - name: o_totalprice
    type: numeric
  - name: o_orderdate
    type: date
  - name: o_orderpriority
    type: bpchar
  - name: o_clerk
    type: bpchar
  - name: o_shippriority
    type: int4
  - name: o_comment
    type: varchar
    Distribution_policy: DISTRIBUTED RANDOMLY

See Also

hawq load, hawq register