Expanding a Cluster
Apache HAWQ supports dynamic node expansion. You can add segment nodes while HAWQ is running without having to suspend or terminate cluster operations.
Note: This topic describes how to expand a cluster using the command-line interface. If you are using Ambari to manage your HAWQ cluster, see Expanding the HAWQ Cluster in Managing HAWQ Using Ambari
Guidelines for Cluster Expansion
This topic provides some guidelines around expanding your HAWQ cluster.
There are several recommendations to keep in mind when modifying the size of your running HAWQ cluster:
- When you add a new node, install both a DataNode and a physical segment on the new node. If you are using YARN to manage HAWQ resources, you must also configure a YARN NodeManager on the new node.
- After adding a new node, you should always rebalance HDFS data to maintain cluster performance.
- Adding or removing a node also necessitates an update to the HDFS metadata cache. This update will happen eventually, but can take some time. To speed the update of the metadata cache, execute
select gp_metadata_cache_clear();. - Note that for hash distributed tables, expanding the cluster will not immediately improve performance since hash distributed tables use a fixed number of virtual segments. In order to obtain better performance with hash distributed tables, you must redistribute the table to the updated cluster by either the ALTER TABLE or CREATE TABLE AS command.
- If you are using hash tables, consider updating the
default_hash_table_bucket_numberserver configuration parameter to a larger value after expanding the cluster but before redistributing the hash tables.
Adding a New Node to an Existing HAWQ Cluster
The following procedure describes the steps required to add a node to an existing HAWQ cluster. First ensure that the new node has been configured per the instructions found in Apache HAWQ System Requirements and Select HAWQ Host Machines.
For example purposes in this procedure, we are adding a new node named sdw4.
Prepare the target machine by checking operating system configurations and passwordless ssh. HAWQ requires passwordless ssh access to all cluster nodes. To set up passwordless ssh on the new node, perform the following steps:
Login to the master HAWQ node as gpadmin. If you are logged in as a different user, switch to the gpadmin user and source the
greenplum_path.shfile.$ su - gpadmin $ source /usr/local/hawq/greenplum_path.shOn the HAWQ master node, change directories to /usr/local/hawq/etc. In this location, create a file called
new_hostsand add the hostname(s) of the node(s) you wish to add to the existing HAWQ cluster, one per line. For example:sdw4Login to the master HAWQ node as root and source the
greenplum_path.shfile.$ su - root $ source /usr/local/hawq/greenplum_path.shExecute the following hawq command to set up passwordless ssh for root on the new host machine:
$ hawq ssh-exkeys -e hawq_hosts -x new_hostsCreate the gpadmin user on the new host(s).
$ hawq ssh -f new_hosts -e '/usr/sbin/useradd gpadmin' $ hawq ssh –f new_hosts -e 'echo -e "changeme\changeme" | passwd gpadmin'Switch to the gpadmin user and source the
greenplum_path.shfile again.$ su - gpadmin $ source /usr/local/hawq/greenplum_path.shExecute the following hawq command a second time to set up passwordless ssh for the gpadmin user:
$ hawq ssh-exkeys -e hawq_hosts -x new_hosts(Optional) If you enabled temporary password-based authentication while preparing/configuring your new HAWQ host system, turn off password-based authentication as described in Apache HAWQ System Requirements.
After setting up passwordless ssh, you can execute the following hawq command to check the target machine’s configuration.
$ hawq check -f new_hostsConfigure operating system parameters as needed on the host machine. See the HAWQ installation documentation for a list of specific operating system parameters to configure.
Login to the target host machine
sdw4as the root user. If you are logged in as a different user, switch to the root account:$ su - rootIf not already installed, install the target machine (
sdw4) as an HDFS DataNode.If you have any user-defined function (UDF) libraries installed in your existing HAWQ cluster, install them on the new node.
Download and install HAWQ on the target machine (
sdw4) as described in the software build instructions or in the distribution installation documentation.On the HAWQ master node, check current cluster and host information using
psql.$ psql -d postgrespostgres=# SELECT * FROM gp_segment_configuration;registration_order | role | status | port | hostname | address --------------------+------+--------+-------+----------+--------------- -1 | s | u | 5432 | sdw1 | 192.0.2.0 0 | m | u | 5432 | mdw | rhel64-1 1 | p | u | 40000 | sdw3 | 192.0.2.2 2 | p | u | 40000 | sdw2 | 192.0.2.1 (4 rows)At this point the new node does not appear in the cluster.
Execute the following command to confirm that HAWQ was installed on the new host:
$ hawq ssh -f new_hosts -e "ls -l $GPHOME"On the master node, use a text editor to add hostname
sdw4into thehawq_hostsfile you created during HAWQ installation. (If you do not already have this file, then you create it first and list all the nodes in your cluster.)mdw smdw sdw1 sdw2 sdw3 sdw4On the master node, use a text editor to add hostname
sdw4to the$GPHOME/etc/slavesfile. This file lists all the segment host names for your cluster. For example:sdw1 sdw2 sdw3 sdw4Sync the
hawq-site.xmlandslavesconfiguration files to all nodes in the cluster (as listed in hawq_hosts).$ hawq scp -f hawq_hosts hawq-site.xml slaves =:$GPHOME/etc/Make sure that the HDFS DataNode service has started on the new node.
On
sdw4, create directories based on the values assigned to the following properties inhawq-site.xml. These new directories must be owned by the same database user (for example,gpadmin) who will execute thehawq init segmentcommand in the next step.-
hawq_segment_directory -
hawq_segment_temp_directoryNote: Thehawq_segment_directorymust be empty.
-
On
sdw4, switch to the database user (for example,gpadmin), and initalize the segment.$ su - gpadmin $ hawq init segmentOn the master node, check current cluster and host information using
psqlto verify that the newsdw4node has initialized successfully.$ psql -d postgrespostgres=# SELECT * FROM gp_segment_configuration ;registration_order | role | status | port | hostname | address --------------------+------+--------+-------+----------+--------------- -1 | s | u | 5432 | sdw1 | 192.0.2.0 0 | m | u | 5432 | mdw | rhel64-1 1 | p | u | 40000 | sdw3 | 192.0.2.2 2 | p | u | 40000 | sdw2 | 192.0.2.1 3 | p | u | 40000 | sdw4 | 192.0.2.3 (5 rows)To maintain optimal cluster performance, rebalance HDFS data by running the following command:
$ sudo -u hdfs hdfs balancer -threshold threshold_valuewhere threshold_value represents how much a DataNode’s disk usage, in percentage, can differ from overall disk usage in the cluster. Adjust the threshold value according to the needs of your production data and disk. The smaller the value, the longer the rebalance time.
Note: If you do not specify a threshold, then a default value of 20 is used. If the balancer detects that a DataNode is using less than a 20% difference of the cluster’s overall disk usage, then data on that node will not be rebalanced. For example, if disk usage across all DataNodes in the cluster is 40% of the cluster’s total disk-storage capacity, then the balancer script ensures that a DataNode’s disk usage is between 20% and 60% of that DataNode’s disk-storage capacity. DataNodes whose disk usage falls within that percentage range will not be rebalanced.
Rebalance time is also affected by network bandwidth. You can adjust network bandwidth used by the balancer by using the following command:
$ sudo -u hdfs hdfs dfsadmin -setBalancerBandwidth network_bandwithThe default value is 1MB/s. Adjust the value according to your network.
Speed up the clearing of the metadata cache by using the following command:
$ psql -d postgrespostgres=# SELECT gp_metadata_cache_clear();After expansion, if the new size of your cluster is greater than or equal (#nodes >=4) to 4, change the value of the
output.replace-datanode-on-failureHDFS parameter inhdfs-client.xmltofalse.(Optional) If you are using hash tables, adjust the
default_hash_table_bucket_numberserver configuration property to reflect the cluster’s new size. Update this configuration’s value by multiplying the new number of nodes in the cluster by the appropriate amount indicated below.Number of Nodes After Expansion Suggested default_hash_table_bucket_number value <= 85 6 * #nodes > 85 and <= 102 5 * #nodes > 102 and <= 128 4 * #nodes > 128 and <= 170 3 * #nodes > 170 and <= 256 2 * #nodes > 256 and <= 512 1 * #nodes > 512 512 If you are using hash distributed tables and wish to take advantage of the performance benefits of using a larger cluster, redistribute the data in all hash-distributed tables by using either the ALTER TABLE or CREATE TABLE AS command. You should redistribute the table data if you modified the
default_hash_table_bucket_numberconfiguration parameter.Note: The redistribution of table data can take a significant amount of time.