Chapter 13. Troubleshooting and Debugging Apache HBase

Table of Contents

13.1. General Guidelines
13.2. Logs
13.2.1. Log Locations
13.2.2. Log Levels
13.2.3. JVM Garbage Collection Logs
13.3. Resources
13.3.1. search-hadoop.com
13.3.2. Mailing Lists
13.3.3. IRC
13.3.4. JIRA
13.4. Tools
13.4.1. Builtin Tools
13.4.2. External Tools
13.5. Client
13.5.1. ScannerTimeoutException or UnknownScannerException
13.5.2. LeaseException when calling Scanner.next
13.5.3. Shell or client application throws lots of scary exceptions during normal operation
13.5.4. Long Client Pauses With Compression
13.5.5. ZooKeeper Client Connection Errors
13.5.6. Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)
13.5.7. Client Slowdown When Calling Admin Methods (flush, compact, etc.)
13.5.8. Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)])
13.6. MapReduce
13.6.1. You Think You're On The Cluster, But You're Actually Local
13.7. NameNode
13.7.1. HDFS Utilization of Tables and Regions
13.7.2. Browsing HDFS for HBase Objects
13.8. Network
13.8.1. Network Spikes
13.8.2. Loopback IP
13.8.3. Network Interfaces
13.9. RegionServer
13.9.1. Startup Errors
13.9.2. Runtime Errors
13.9.3. Shutdown Errors
13.10. Master
13.10.1. Startup Errors
13.10.2. Shutdown Errors
13.11. ZooKeeper
13.11.1. Startup Errors
13.11.2. ZooKeeper, The Cluster Canary
13.12. Amazon EC2
13.12.1. ZooKeeper does not seem to work on Amazon EC2
13.12.2. Instability on Amazon EC2
13.12.3. Remote Java Connection into EC2 Cluster Not Working
13.13. HBase and Hadoop version issues
13.13.1. NoClassDefFoundError when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)
13.13.2. ...cannot communicate with client version...
13.14. Running unit or integration tests
13.14.1. Runtime exceptions from MiniDFSCluster when running tests
13.15. Case Studies
13.16. Cryptographic Features
13.16.1. sun.security.pkcs11.wrapper.PKCS11Exception: CKR_ARGUMENTS_BAD

13.1. General Guidelines

Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.

An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print some metrics when aborting so grepping for Dump should get you around the start of the problem.

RegionServer suicides are “normal”, as this is what they do when something goes wrong. For example, if ulimit and xcievers (the two most important initial settings, see Section 2.1.2.5, “ ulimit and nproc) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS. Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Section 12.3.1.1, “Long GC pauses” above.

comments powered by Disqus