HDFS Configuration Reference

This reference page describes HDFS configuration values that are configured for HAWQ either within hdfs-site.xml, core-site.xml, or hdfs-client.xml.

HDFS Site Configuration (hdfs-site.xml and core-site.xml)

This topic provides a reference of the HDFS site configuration values recommended for HAWQ installations. These parameters are located in either hdfs-site.xml or core-site.xml of your HDFS deployment.

This table describes the configuration parameters and values that are recommended for HAWQ installations. Only HDFS parameters that need to be modified or customized for HAWQ are listed.

Parameter Description Recommended Value for HAWQ Installs Comments
dfs.allow.truncate Allows truncate. true HAWQ requires that you enable dfs.allow.truncate. The HAWQ service will fail to start if dfs.allow.truncate is not set to true.
dfs.block.access.token.enable If true, access tokens are used as capabilities for accessing DataNodes. If false, no access tokens are checked on accessing DataNodes. false for an unsecured HDFS cluster, or true for a secure cluster  
dfs.block.local-path-access.user Comma separated list of the users allowed to open block files on legacy short-circuit local read. gpadmin  
dfs.client.read.shortcircuit This configuration parameter turns on short-circuit local reads. true In Ambari, this parameter corresponds to HDFS Short-circuit read. The value for this parameter should be the same in hdfs-site.xml and HAWQ’s hdfs-client.xml.
dfs.client.socket-timeout The amount of time before a client connection times out when establishing a connection or reading. The value is expressed in milliseconds. 300000000  
dfs.client.use.legacy.blockreader.local Setting this value to false specifies that the new version of the short-circuit reader is used. Setting this value to true means that the legacy short-circuit reader would be used. false  
dfs.datanode.data.dir.perm Permissions for the directories on on the local filesystem where the DFS DataNode stores its blocks. The permissions can either be octal or symbolic. 750 In Ambari, this parameter corresponds to DataNode directories permission
dfs.datanode.handler.count The number of server threads for the DataNode. 60  
dfs.datanode.max.transfer.threads Specifies the maximum number of threads to use for transferring data in and out of the DataNode. 40960 In Ambari, this parameter corresponds to DataNode max data transfer threads
dfs.datanode.socket.write.timeout The amount of time before a write operation times out, expressed in milliseconds. 7200000  
dfs.domain.socket.path (Optional.) The path to a UNIX domain socket to use for communication between the DataNode and local HDFS clients. If the string “_PORT” is present in this path, it is replaced by the TCP port of the DataNode.   If set, the value for this parameter should be the same in hdfs-site.xml and HAWQ’s hdfs-client.xml.
dfs.namenode.accesstime.precision The access time for HDFS file is precise up to this value. Setting a value of 0 disables access times for HDFS. 0 In Ambari, this parameter corresponds to Access time precision
dfs.namenode.handler.count The number of server threads for the NameNode. 600  
dfs.support.append Whether HDFS is allowed to append to files. true  
ipc.client.connection.maxidletime The maximum time in milliseconds after which a client will bring down the connection to the server. 3600000 In core-site.xml
ipc.client.connect.timeout Indicates the number of milliseconds a client will wait for the socket to establish a server connection. 300000 In core-site.xml
ipc.server.listen.queue.size Indicates the length of the listen queue for servers accepting client connections. 3300 In core-site.xml

HDFS Client Configuration (hdfs-client.xml)

This topic provides a reference of the HAWQ configuration values located in $GPHOME/etc/hdfs-client.xml.

This table describes the configuration parameters and their default values:

Parameter Description Default Value Comments
dfs.client.failover.max.attempts The maximum number of times that the DFS client retries issuing a RPC call when multiple NameNodes are configured. 15  
dfs.client.log.severity The minimal log severity level. Valid values include: FATAL, ERROR, INFO, DEBUG1, DEBUG2, and DEBUG3. INFO  
dfs.client.read.shortcircuit Determines whether the DataNode is bypassed when reading file blocks, if the block and client are on the same node. The default value, true, bypasses the DataNode. true The value for this parameter should be the same in hdfs-site.xml and HAWQ’s hdfs-client.xml.
dfs.client.use.legacy.blockreader.local Determines whether the legacy short-circuit reader implementation, based on HDFS-2246, is used. Set this property to true on non-Linux platforms that do not have the new implementation based on HDFS-347. false  
dfs.default.blocksize Default block size, in bytes. 134217728 Default is equivalent to 128 MB. 
dfs.default.replica The default number of replicas. 3  
dfs.domain.socket.path (Optional.) The path to a UNIX domain socket to use for communication between the DataNode and local HDFS clients. If the string “_PORT” is present in this path, it is replaced by the TCP port of the DataNode.   If set, the value for this parameter should be the same in hdfs-site.xml and HAWQ’s hdfs-client.xml.
dfs.prefetchsize The number of blocks for which information is pre-fetched. 10  
hadoop.security.authentication Specifies the type of RPC authentication to use. A value of simple indicates no authentication. A value of kerberos enables authentication by Kerberos. simple  
input.connect.timeout The timeout interval, in milliseconds, for when the input stream is setting up a connection to a DataNode. 600000  Default is equal to 1 hour.
input.localread.blockinfo.cachesize The size of the file block path information cache, in bytes. 1000  
input.localread.default.buffersize The size of the buffer, in bytes, used to hold data from the file block and verify the checksum. This value is used only when dfs.client.read.shortcircuit is set to true. 1048576 Default is equal to 1MB. Only used when is set to true.

If an older version of hdfs-client.xml is retained during upgrade, to avoid performance degradation, set the input.localread.default.buffersize to 2097152. 

input.read.getblockinfo.retry The maximum number of times the client should retry getting block information from the NameNode. 3
input.read.timeout The timeout interval, in milliseconds, for when the input stream is reading from a DataNode. 3600000 Default is equal to 1 hour.
input.write.timeout The timeout interval, in milliseconds, for when the input stream is writing to a DataNode. 3600000  
output.close.timeout The timeout interval for closing an output stream, in milliseconds. 900000 Default is equal to 1.5 hours.
output.connect.timeout The timeout interval, in milliseconds, for when the output stream is setting up a connection to a DataNode. 600000 Default is equal to 10 minutes.
output.default.chunksize The chunk size of the pipeline, in bytes. 512  
output.default.packetsize The packet size of the pipeline, in bytes. 65536 Default is equal to 64KB. 
output.default.write.retry The maximum number of times that the client should reattempt to set up a failed pipeline. 10  
output.packetpool.size The maximum number of packets in a file’s packet pool. 1024  
output.read.timeout The timeout interval, in milliseconds, for when the output stream is reading from a DataNode. 3600000 Default is equal to 1 hour. 
output.replace-datanode-on-failure Determines whether the client adds a new DataNode to pipeline if the number of nodes in the pipeline is less than the specified number of replicas. false (if # of nodes less than or equal to 4), otherwise true When you deploy a HAWQ cluster, the hawq init utility detects the number of nodes in the cluster and updates this configuration parameter accordingly. However, when expanding an existing cluster to 4 or more nodes, you must manually set this value to true. Set to false if you remove existing nodes and fall under 4 nodes.
output.write.timeout The timeout interval, in milliseconds, for when the output stream is writing to a DataNode. 3600000 Default is equal to 1 hour.
rpc.client.connect.retry The maximum number of times to retry a connection if the RPC client fails connect to the server. 10  
rpc.client.connect.tcpnodelay Determines whether TCP_NODELAY is used when connecting to the RPC server. true  
rpc.client.connect.timeout The timeout interval for establishing the RPC client connection, in milliseconds. 600000 Default equals 10 minutes.
rpc.client.max.idle The maximum idle time for an RPC connection, in milliseconds. 10000 Default equals 10 seconds.
rpc.client.ping.interval The interval which the RPC client send a heart beat to server. 0 means disable. 10000  
rpc.client.read.timeout The timeout interval, in milliseconds, for when the RPC client is reading from the server. 3600000 Default equals 1 hour.
rpc.client.socket.linger.timeout The value to set for the SO_LINGER socket when connecting to the RPC server. -1  
rpc.client.timeout The timeout interval of an RPC invocation, in milliseconds. 3600000 Default equals 1 hour.
rpc.client.write.timeout The timeout interval, in milliseconds, for when the RPC client is writing to the server. 3600000 Default equals 1 hour.