For more information on the RegionServers, see Section 9.6, “RegionServer”.
The Master believes the RegionServers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost.
The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1.
/etc/hosts on the region servers, from...
# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 fully.qualified.regionservername regionservername localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6
... to (removing the master node's name from localhost)...
# Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6
Since compression algorithms such as LZO need to be installed and configured on each cluster this is a frequent source of startup error. If you see messages like this...
11/02/20 01:32:15 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028)
.. then there is a path issue with the compression libraries. See the Configuration section on LZO compression configuration.
Are you running an old JVM (< 1.6.0_u21?)? When you look at a thread dump, does it
look like threads are BLOCKED but no one holds the lock all are blocked on? See HBASE 3622 Deadlock in
HBaseServer (JVM bug?). Adding
-XX:+UseMembar to the HBase
conf/hbase-env.sh may fix it.
If you see log messages like this...
2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:883)
... see the Getting Started section on ulimit and nproc configuration.
This typically shows up in the DataNode logs.
See the Getting Started section on xceivers configuration.
See the Getting Started section on ulimit and nproc configuration. The default on recent Linux distributions is 1024 - which is far too low for HBase.
If you see warning messages like this...
2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000 2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying
... or see full GC compactions then you may be experiencing full GC's.
These errors can happen either when running out of OS file handles or in periods of severe network problems where the nodes are unreachable.
See the Getting Started section on ulimit and nproc configuration and check your network.
Master or RegionServers shutting down with messages like those in the logs:
WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec java.io.IOException: TIMED OUT at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT] INFO org.apache.zookeeper.ClientCnxn: Server connection successful WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e java.io.IOException: Session Expired at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945) ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out. By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.
Make sure you give plenty of RAM (in
default of 1GB won't be able to sustain long running imports.
Make sure you don't swap, the JVM never behaves well under swapping.
Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.
Increase the ZooKeeper session timeout
If you wish to increase the session timeout, add the following to your
hbase-site.xml to increase the timeout from the default of 60
seconds to 120 seconds.
<property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property>
Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).
If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.
See Section 15.11.2, “ZooKeeper, The Cluster Canary” for other general information about ZooKeeper troubleshooting.
This exception is "normal" when found in the RegionServer logs at DEBUG level. This exception is returned back to the client and then the client goes back to hbase:meta to find the new location of the moved region.
However, if the NotServingRegionException is logged ERROR, then the client ran out of retries and something probably wrong.
Fix your DNS. In versions of Apache HBase before 0.92.x, reverse DNS needs to give same answer as forward lookup. See HBASE 3431 RegionServer is not using the name given it by the master; double entry in master listing of servers for gorey details.
We are not using the native versions of compression libraries. See HBASE-1900 Put back native support when hadoop 0.21 is released. Copy the native libs from hadoop under hbase lib dir or symlink them into place and the message should go away.
If you see this type of message it means that the region server was trying to read/send data from/to a client but it already went away. Typical causes for this are if the client was killed (you see a storm of messages like this when a MapReduce job is killed or fails) or if the client receives a SocketTimeoutException. It's harmless, but you should consider digging in a bit more if you aren't doing something to trigger them.
Several operations within HBase, including snapshots, rely on properly configured reverse DNS. Some environments, such as Amazon EC2, have trouble with reverse DNS. If you see errors like the following on your RegionServers, check your reverse DNS configuration:
2013-05-01 00:04:56,356 DEBUG org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure 'backup1' coordinator notified of 'acquire', waiting on 'reached' or 'abort' from coordinator.
In general, the hostname reported by the RegionServer needs to be the same as the hostname the Master is trying to reach. You can see a hostname mismatch by looking for the following type of message in the RegionServer's logs at start-up.
2013-05-01 00:03:00,614 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us hostname to use. Was=myhost-1234, Now=ip-10-55-88-99.ec2.internal