Write the gpfdist Configuration
The gpfdist
configuration is specified as a YAML 1.1 document. It specifies rules that gpfdist
uses to select a Transform to apply when loading or extracting data.
This example gpfdist
configuration contains the following items:
- the
config.yaml
file definingTRANSFORMATIONS
- the
input_transform.sh
wrapper script, referenced in theconfig.yaml
file - the
input_transform.stx
joost transformation, called frominput_transform.sh
Aside from the ordinary YAML rules, such as starting the document with three dashes (---
), a gpfdist
configuration must conform to the following restrictions:
- a
VERSION
setting must be present with the value1.0.0.1
. - a
TRANSFORMATIONS
setting must be present and contain one or more mappings. Each mapping in the
TRANSFORMATION
must contain:- a
TYPE
with the value ‘input’ or 'output’ - a
COMMAND
indicating how the transform is run.
- a
Each mapping in the
TRANSFORMATION
can contain optionalCONTENT
,SAFE
, andSTDERR
settings.
The following gpfdist
configuration called config.YAML
applies to the prices example. The initial indentation on each line is significant and reflects the hierarchical nature of the specification. The name prices_input
in the following example will be referenced later when creating the table in SQL.
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
prices_input:
TYPE: input
COMMAND: /bin/bash input_transform.sh %filename%
The COMMAND
setting uses a wrapper script called input_transform.sh
with a %filename%
placeholder. When gpfdist
runs the prices_input
transform, it invokes input_transform.sh
with /bin/bash
and replaces the %filename%
placeholder with the path to the input file to transform. The wrapper script called input_transform.sh
contains the logic to invoke the STX transformation and return the output.
If Joost is used, the Joost STX engine must be installed.
#!/bin/bash
# input_transform.sh - sample input transformation,
# demonstrating use of Java and Joost STX to convert XML into
# text to load into HAWQ.
# java arguments:
# -jar joost.jar joost STX engine
# -nodecl don't generate a <?xml?> declaration
# $1 filename to process
# input_transform.stx the STX transformation
#
# the AWK step eliminates a blank line joost emits at the end
java \
-jar joost.jar \
-nodecl \
$1 \
input_transform.stx \
| awk 'NF>0
The input_transform.sh
file uses the Joost STX engine with the AWK interpreter. The following diagram shows the process flow as gpfdist
runs the transformation.