Using Profiles to Read and Write Data
PXF profiles are collections of common metadata attributes that can be used to simplify the reading and writing of data. You can use any of the built-in profiles that come with PXF or you can create your own.
For example, if you are writing single line records to text files on HDFS, you could use the built-in HdfsTextSimple profile. You specify this profile when you create the PXF external table used to write the data to HDFS.
Built-In Profiles
PXF comes with a number of built-in profiles that group together a collection of metadata attributes. PXF built-in profiles simplify access to the following types of data storage systems:
- HDFS File Data (Read + Write)
- Hive (Read only)
- HBase (Read only)
- JSON (Read only)
You can specify a built-in profile when you want to read data that exists inside HDFS files, Hive tables, HBase tables, or JSON files, and when you want to write data into HDFS files.
Profile | Description | Fragmenter/Accessor/Resolver/Metadata/OutputFormat |
---|---|---|
HdfsTextSimple | Read or write delimited single line records from or to plain text files on HDFS. |
|
HdfsTextMulti | Read delimited single or multi-line records (with quoted linefeeds) from plain text files on HDFS. This profile is not splittable (non parallel); reading is slower than reading with HdfsTextSimple. |
|
Hive | Read a Hive table with any of the available storage formats: text, RC, ORC, Sequence, or Parquet. |
|
HiveRC | Optimized read of a Hive table where each partition is stored as an RCFile.
Note: The DELIMITER parameter is mandatory.
|
|
HiveORC | Optimized read of a Hive table where each partition is stored as an ORC file. |
|
HiveVectorizedORC | Optimized bulk/batch read of a Hive table where each partition is stored as an ORC file. |
|
HiveText | Optimized read of a Hive table where each partition is stored as a text file.
Note: The DELIMITER parameter is mandatory.
|
|
HBase | Read an HBase data store engine. |
|
Avro | Read Avro files (fileName.avro). |
|
JSON | Read JSON files (fileName.json) from HDFS. |
|
Notes: Metadata identifies the Java class that provides field definitions in the relation. OutputFormat identifies the output serialization format (text or binary) for which a specific profile is optimized. While the built-in Hive*
profiles provide Metadata and OutputFormat classes, other profiles may have no need to implement or specify these classes.
Adding and Updating Profiles
Each profile has a mandatory unique name and an optional description. In addition, each profile contains a set of plug-ins that are an extensible set of metadata attributes. Administrators can add new profiles or edit the built-in profiles defined in /etc/pxf/conf/pxf-profiles.xml
.
Note: Add the JAR files associated with custom PXF plug-ins to the /etc/pxf/conf/pxf-public.classpath
configuration file.
After you make changes in pxf-profiles.xml
(or any other PXF configuration file), propagate the changes to all nodes with PXF installed, and then restart the PXF service on all nodes.