17.3. Node Management

17.3.1. Node Decommission

You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:

$ ./bin/hbase-daemon.sh stop regionserver

The RegionServer will first close all regions and then shut itself down. On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.

Disable the Load Balancer before Decommissioning a node

If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer. Avoid any problems by disabling the balancer first. See Load Balancer below.

A downside to the above stop of a RegionServer is that regions could be offline for a good period of time. Regions are closed in order. If many regions on the server, the first region to close may not be back online until all regions close and after the master notices the RegionServer's znode gone. In Apache HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down. Apache HBase 0.90.2 added the graceful_stop.sh script. Here is its usage:

$ ./bin/graceful_stop.sh
Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname>
 thrift      If we should stop/start thrift before/after the hbase stop/start
 rest        If we should stop/start rest before/after the hbase stop/start
 restart     If we should restart after graceful stop
 reload      Move offloaded regions back on to the stopped server
 debug       Move offloaded regions back on to the stopped server
 hostname    Hostname of server we are to stop

To decommission a loaded RegionServer, run the following: $ ./bin/graceful_stop.sh HOSTNAME where HOSTNAME is the host carrying the RegionServer you would decommission.

On HOSTNAME

The HOSTNAME passed to graceful_stop.sh must match the hostname that hbase is using to identify RegionServers. Check the list of RegionServers in the master UI for how HBase is referring to servers. Its usually hostname but can also be FQDN. Whatever HBase is using, this is what you should pass the graceful_stop.sh decommission script. If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks if server is currently running; the graceful unloading of regions will not run.

The graceful_stop.sh script will move the regions off the decommissioned RegionServer one at a time to minimize region churn. It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions. At this point, the graceful_stop.sh tells the RegionServer stop. The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.

Load Balancer

It is assumed that the Region Load Balancer is disabled while the graceful_stop script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:

hbase(main):001:0> balance_switch false
true
0 row(s) in 0.3590 seconds

This turns the balancer OFF. To reenable, do:

hbase(main):001:0> balance_switch true
false
0 row(s) in 0.3590 seconds

The graceful_stop will check the balancer and if enabled, will turn it off before it goes to work. If it exits prematurely because of error, it will not have reset the balancer. Hence, it is better to manage the balancer apart from graceful_stop reenabling it after you are done w/ graceful_stop.

17.3.1.1. Decommissioning several Regions Servers concurrently

If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping mutiple RegionServers concurrently. To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state. This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the hbase_root/draining znode. This znode has format name,port,startcode just like the regionserver entries under hbase_root/rs znode.

Without this facility, decommissioning mulitple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining. Marking RegionServers to be in the draining state prevents this from happening[32].

17.3.1.2. Bad or Failing Disk

It is good having Section 2.5.2.2.1, “dfs.datanode.failed.volumes.tolerated” set if you have a decent number of disks per machine for the case where a disk plain dies. But usually disks do the "John Wayne" -- i.e. take a while to go down spewing errors in dmesg -- or for some reason, run much slower than their companions. In this case you want to decommission the disk. You have two options. You can decommission the datanode or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.

Short Circuit Reads

If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down. Move the regions back after you restart the datanode.

17.3.2. Rolling Restart

You can also ask this script to restart a RegionServer after the shutdown AND move its old regions back into place. The latter you might do to retain data locality. A primitive rolling restart might be effected by running something like the following:

$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &

Tail the output of /tmp/log.txt to follow the scripts progress. The above does RegionServers only. The script will also disable the load balancer before moving the regions. You'd need to do the master update separately. Do it before you run the above script. Here is a pseudo-script for how you might craft a rolling restart script:

  1. Untar your release, make sure of its configuration and then rsync it across the cluster. If this is 0.90.2, patch it with HBASE-3744 and HBASE-3756.

  2. Run hbck to ensure the cluster consistent

    $ ./bin/hbase hbck

    Effect repairs if inconsistent.

  3. Restart the Master:

    $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master

  4. Run the graceful_stop.sh script per RegionServer. For example:

    $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
                

    If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage for graceful_stop.sh script).

  5. Restart the Master again. This will clear out dead servers list and reenable the balancer.

  6. Run hbck to ensure the cluster is consistent.

It is important to drain HBase regions slowly when restarting regionservers. Otherwise, multiple regions go offline simultaneously as they are re-assigned to other nodes. Depending on your usage patterns, this might not be desirable.

17.3.3. Adding a New Node

Adding a new regionserver in HBase is essentially free, you simply start it like this: $ ./bin/hbase-daemon.sh start regionserver and it will register itself with the master. Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files. If you rely on ssh to start your daemons, don't forget to add the new hostname in conf/regionservers on the master.

At this point the region server isn't serving data because no regions have moved to it yet. If the balancer is enabled, it will start moving regions to the new RS. On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time. It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).

The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests. Apart from resulting in higher latency, it may also be able to use all of your network card's capacity. For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than 100MB/s. In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.



[32] See this blog post for more details.

comments powered by Disqus