Accessing JSON File Data
The PXF JSON plug-in reads native JSON stored in HDFS. The plug-in supports common data types, as well as basic (N-level) projection and arrays.
To access JSON file data with HAWQ, the data must be stored in HDFS and an external table created from the HDFS data store.
Prerequisites
Before working with JSON file data using HAWQ and PXF, ensure that:
- The PXF HDFS plug-in is installed on all cluster nodes.
- The PXF JSON plug-in is installed on all cluster nodes.
- You have tested PXF on HDFS.
Working with JSON Files
JSON is a text-based data-interchange format. JSON data is typically stored in a file with a .json
suffix. A .json
file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or array. Objects and arrays can be nested.
Refer to Introducing JSON for specific information on JSON syntax.
Sample JSON data file content:
{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user": {
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
}
JSON to HAWQ Data Type Mapping
To represent JSON data in HAWQ, map data values that use a primitive data type to HAWQ columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.
The following table summarizes external mapping rules for JSON data.
JSON Data Type | HAWQ Data Type |
---|---|
Primitive type (integer, float, string, boolean, null) | Use the corresponding HAWQ built-in data type; see Data Types. |
Array | Use [] brackets to identify a specific array index to a member of primitive type. |
Object | Use dot . notation to specify each level of projection (nesting) to a member of a primitive type. |
JSON File Read Modes
The PXF JSON plug-in reads data in one of two modes. The default mode expects one full JSON record per line. The JSON plug-in also supports a read mode operating on multi-line JSON records.
In the following discussion, a data set defined by a sample schema will be represented using each read mode of the PXF JSON plug-in. The sample schema contains data fields with the following names and data types:
- “created_at” - text
- “id_str” - text
- “user” - object
- “id” - integer
- “location” - text
- “coordinates” - object (optional)
- “type” - text
- “values” - array
- [0] - integer
- [1] - integer
Example 1 - Data Set for Single-JSON-Record-Per-Line Read Mode:
{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{
"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values"
: [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{
"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{
"id":287819058,"location":""}, "coordinates": null}
Example 2 - Data Set for Multi-Line JSON Record Read Mode:
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
},
"record_obj":{
"created_at":"MonSep3004:04:54+00002013",
"id_str":"384529260872228864",
"user":{
"id":67600981,
"location":"KryberWorld"
},
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
}
}
]
}
Loading JSON Data to HDFS
The PXF JSON plug-in reads native JSON stored in HDFS. Before JSON data can be queried via HAWQ, it must first be loaded to an HDFS data store.
Copy and paste the single line JSON record data set to a file named singleline.json
. Similarly, copy and paste the multi-line JSON record data set to multiline.json
.
Note: Ensure there are no blank lines in your JSON files.
Add the data set files to the HDFS data store:
$ hdfs dfs -mkdir /user/data
$ hdfs dfs -put singleline.json /user/data
$ hdfs dfs -put multiline.json /user/data
Once loaded to HDFS, JSON data may be queried and analyzed via HAWQ.
Querying External JSON Data
Use the following syntax to create an external table representing JSON data:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ( 'pxf://<host>[:<port>]/<path-to-data>?PROFILE=Json[&IDENTIFIER=<value>]' )
FORMAT 'CUSTOM' ( FORMATTER='pxfwritable_import' );
JSON-plug-in-specific keywords and values used in the CREATE EXTERNAL TABLE
call are described below.
Keyword | Value |
---|---|
<host> | The PXF host. While <host> may identify any PXF agent node, use the HDFS NameNode as it is guaranteed to be available in a running HDFS cluster. If HDFS High Availability is enabled, <host> must identify the HDFS NameService. |
<port> | The PXF port. If <port> is omitted, PXF assumes <host> identifies a High Availability HDFS Nameservice and connects to the port number designated by the pxf_service_port server configuration parameter value. Default is 51200. |
PROFILE | The PROFILE keyword must specify the value Json . |
IDENTIFIER | Include the IDENTIFIER keyword and <value> in the LOCATION string only when accessing a JSON file with multi-line records. <value> should identify the member name used to determine the encapsulating JSON object to return. (If the JSON file is the multi-line record Example 2 above, &IDENTIFIER=created_at would be specified.) |
FORMAT | The FORMAT clause must specify CUSTOM . |
FORMATTER | The JSON CUSTOM format supports only the built-in pxfwritable_import FORMATTER . |
Note: The ANALYZE
command is not supported on external tables you create with the Json
profile.
Example 1
The following CREATE EXTERNAL TABLE SQL call creates a queryable external table based on the data in the single-line-per-record JSON example.
CREATE EXTERNAL TABLE sample_json_singleline_tbl(
created_at TEXT,
id_str TEXT,
text TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://namenode:51200/user/data/singleline.json?PROFILE=Json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
SELECT * FROM sample_json_singleline_tbl;
Notice the use of .
projection to access the nested fields in the user
and coordinates
objects. Also notice the use of []
to access the specific elements of the coordinates.values
array.
Example 2
A CREATE EXTERNAL TABLE
SQL call to create a queryable external table based on the multi-line-per-record JSON data set would be very similar to that of the single line data set above. You might specify a different database name, sample_json_multiline_tbl
for example.
The LOCATION
clause would differ. The IDENTIFIER
keyword and an associated value must be specified when reading from multi-line JSON records:
LOCATION('pxf://namenode:51200/user/data/multiline.json?PROFILE=Json&IDENTIFIER=created_at')
created_at
identifies the member name used to determine the encapsulating JSON object, record_obj
in this case.
To query this external table populated with JSON data:
SELECT * FROM sample_json_multiline_tbl;