Write the gpfdist Configuration

The gpfdist configuration is specified as a YAML 1.1 document. It specifies rules that gpfdist uses to select a Transform to apply when loading or extracting data.

This example gpfdist configuration contains the following items:

  • the config.yaml file defining TRANSFORMATIONS
  • the input_transform.sh wrapper script, referenced in the config.yaml file
  • the input_transform.stx joost transformation, called from input_transform.sh

Aside from the ordinary YAML rules, such as starting the document with three dashes (---), a gpfdist configuration must conform to the following restrictions:

  1. a VERSION setting must be present with the value 1.0.0.1.
  2. a TRANSFORMATIONS setting must be present and contain one or more mappings.
  3. Each mapping in the TRANSFORMATION must contain:

    • a TYPE with the value ‘input’ or 'output’
    • a COMMAND indicating how the transform is run.
  4. Each mapping in the TRANSFORMATION can contain optional CONTENT, SAFE, and STDERR settings.

The following gpfdist configuration called config.YAML applies to the prices example. The initial indentation on each line is significant and reflects the hierarchical nature of the specification. The name prices_input in the following example will be referenced later when creating the table in SQL.

---
VERSION: 1.0.0.1
TRANSFORMATIONS:
  prices_input:
    TYPE:     input
    COMMAND:  /bin/bash input_transform.sh %filename%

The COMMAND setting uses a wrapper script called input_transform.sh with a %filename% placeholder. When gpfdist runs the prices_input transform, it invokes input_transform.sh with /bin/bash and replaces the %filename% placeholder with the path to the input file to transform. The wrapper script called input_transform.sh contains the logic to invoke the STX transformation and return the output.

If Joost is used, the Joost STX engine must be installed.

#!/bin/bash
# input_transform.sh - sample input transformation, 
# demonstrating use of Java and Joost STX to convert XML into
# text to load into HAWQ.
# java arguments:
#   -jar joost.jar         joost STX engine
#   -nodecl                  don't generate a <?xml?> declaration
#   $1                        filename to process
#   input_transform.stx    the STX transformation
#
# the AWK step eliminates a blank line joost emits at the end
java \
    -jar joost.jar \
    -nodecl \
    $1 \
    input_transform.stx \
 | awk 'NF>0

The input_transform.sh file uses the Joost STX engine with the AWK interpreter. The following diagram shows the process flow as gpfdist runs the transformation.