HBaseContext

Instance Constructors

new HBaseContext(sc: SparkContext, config: Configuration, tmpHdfsConfgFile: String = null)

Type Members

class WriterLength extends AnyRef

This is a wrapper class around StoreFileWriter.

Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
var appliedCredentials: Boolean
def applyCreds[T](): Unit
final def asInstanceOf[T0]: T0

Definition Classes
Any
val broadcastedConf: Broadcast[SerializableWritable[Configuration]]
def bulkDelete[T](rdd: RDD[T], tableName: TableName, f: (T) ⇒ Delete, batchSize: Integer): Unit

A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.foreachPartition method.
It allow addition support for a user to take a RDD and generate delete and send them to HBase. The complexity of managing the Connection is removed from the developer
rdd
Original RDD with data to iterate over
tableName
The name of the table to delete from
f
Function to convert a value in the RDD to a HBase Deletes
batchSize
The number of delete to batch before sending to HBase
def bulkGet[T, U](tableName: TableName, batchSize: Integer, rdd: RDD[T], makeGet: (T) ⇒ Get, convertResult: (Result) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]

A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.mapPartition method.
It allow addition support for a user to take a RDD and generates a new RDD based on Gets and the results they bring back from HBase
tableName
The name of the table to get from
rdd
Original RDD with data to iterate over
makeGet
function to convert a value in the RDD to a HBase Get
convertResult
This will convert the HBase Result object to what ever the user wants to put in the resulting RDD return new RDD that is created by the Get to HBase
def bulkLoad[T](rdd: RDD[T], tableName: TableName, flatMap: (T) ⇒ Iterator[(KeyFamilyQualifier, Array[Byte])], stagingDir: String, familyHFileWriteOptionsMap: Map[Array[Byte], FamilyHFileWriteOptions] = ..., compactionExclude: Boolean = false, maxSize: Long = HConstants.DEFAULT_MAX_FILE_SIZE): Unit

Spark Implementation of HBase Bulk load for wide rows or when values are not already combined at the time of the map process
Spark Implementation of HBase Bulk load for wide rows or when values are not already combined at the time of the map process
This will take the content from an existing RDD then sort and shuffle it with respect to region splits. The result of that sort and shuffle will be written to HFiles.
After this function is executed the user will have to call LoadIncrementalHFiles.doBulkLoad(...) to move the files into HBase
Also note this version of bulk load is different from past versions in that it includes the qualifier as part of the sort process. The reason for this is to be able to support rows will very large number of columns.
T
The Type of values in the original RDD
rdd
The RDD we are bulk loading from
tableName
The HBase table we are loading into
flatMap
A flapMap function that will make every row in the RDD into N cells for the bulk load
stagingDir
The location on the FileSystem to bulk load into
familyHFileWriteOptionsMap
Options that will define how the HFile for a column family is written
compactionExclude
Compaction excluded for the HFiles
maxSize
Max size for the HFiles before they roll
def bulkLoadThinRows[T](rdd: RDD[T], tableName: TableName, mapFunction: (T) ⇒ (ByteArrayWrapper, FamiliesQualifiersValues), stagingDir: String, familyHFileWriteOptionsMap: Map[Array[Byte], FamilyHFileWriteOptions] = ..., compactionExclude: Boolean = false, maxSize: Long = HConstants.DEFAULT_MAX_FILE_SIZE): Unit

Spark Implementation of HBase Bulk load for short rows some where less then a 1000 columns.
Spark Implementation of HBase Bulk load for short rows some where less then a 1000 columns. This bulk load should be faster for tables will thinner rows then the other spark implementation of bulk load that puts only one value into a record going into a shuffle
This will take the content from an existing RDD then sort and shuffle it with respect to region splits. The result of that sort and shuffle will be written to HFiles.
After this function is executed the user will have to call LoadIncrementalHFiles.doBulkLoad(...) to move the files into HBase
In this implementation, only the rowKey is given to the shuffle as the key and all the columns are already linked to the RowKey before the shuffle stage. The sorting of the qualifier is done in memory out side of the shuffle stage
Also make sure that incoming RDDs only have one record for every row key.
T
The Type of values in the original RDD
rdd
The RDD we are bulk loading from
tableName
The HBase table we are loading into
mapFunction
A function that will convert the RDD records to the key value format used for the shuffle to prep for writing to the bulk loaded HFiles
stagingDir
The location on the FileSystem to bulk load into
familyHFileWriteOptionsMap
Options that will define how the HFile for a column family is written
compactionExclude
Compaction excluded for the HFiles
maxSize
Max size for the HFiles before they roll
def bulkPut[T](rdd: RDD[T], tableName: TableName, f: (T) ⇒ Put): Unit

A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.foreachPartition method.
It allow addition support for a user to take RDD and generate puts and send them to HBase. The complexity of managing the Connection is removed from the developer
rdd
Original RDD with data to iterate over
tableName
The name of the table to put into
f
Function to convert a value in the RDD to a HBase Put
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val config: Configuration
var credentials: Credentials
val credentialsConf: Broadcast[SerializableWritable[Credentials]]
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def foreachPartition[T](dstream: DStream[T], f: (Iterator[T], Connection) ⇒ Unit): Unit

A simple enrichment of the traditional Spark Streaming dStream foreach This function differs from the original in that it offers the developer access to a already connected Connection object
A simple enrichment of the traditional Spark Streaming dStream foreach This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
dstream
Original DStream with data to iterate over
f
Function to be given a iterator to iterate through the DStream values and a Connection object to interact with HBase
def foreachPartition[T](rdd: RDD[T], f: (Iterator[T], Connection) ⇒ Unit): Unit

A simple enrichment of the traditional Spark RDD foreachPartition.
A simple enrichment of the traditional Spark RDD foreachPartition. This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
rdd
Original RDD with data to iterate over
f
Function to be given a iterator to iterate through the RDD values and a Connection object to interact with HBase
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
def hbaseRDD(tableName: TableName, scans: Scan): RDD[(ImmutableBytesWritable, Result)]

A overloaded version of HBaseContext hbaseRDD that defines the type of the resulting RDD
A overloaded version of HBaseContext hbaseRDD that defines the type of the resulting RDD
tableName
the name of the table to scan
scans
the HBase scan object to use to read data from HBase
returns
New RDD with results from scan
def hbaseRDD[U](tableName: TableName, scan: Scan, f: ((ImmutableBytesWritable, Result)) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]

This function will use the native HBase TableInputFormat with the given scan object to generate a new RDD
This function will use the native HBase TableInputFormat with the given scan object to generate a new RDD
tableName
the name of the table to scan
scan
the HBase scan object to use to read data from HBase
f
function to convert a Result object from HBase into what the user wants in the final generated RDD
returns
new RDD with results from scan
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
val job: Job
def log: Logger

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def mapPartitions[T, R](rdd: RDD[T], mp: (Iterator[T], Connection) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

A simple enrichment of the traditional Spark RDD mapPartition.
A simple enrichment of the traditional Spark RDD mapPartition. This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
rdd
Original RDD with data to iterate over
mp
Function to be given a iterator to iterate through the RDD values and a Connection object to interact with HBase
returns
Returns a new RDD generated by the user definition function just like normal mapPartition
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def streamBulkDelete[T](dstream: DStream[T], tableName: TableName, f: (T) ⇒ Delete, batchSize: Integer): Unit

A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.streamBulkMutation method.
It allow addition support for a user to take a DStream and generate Delete and send them to HBase.
The complexity of managing the Connection is removed from the developer
dstream
Original DStream with data to iterate over
tableName
The name of the table to delete from
f
function to convert a value in the DStream to a HBase Delete
batchSize
The number of deletes to batch before sending to HBase
def streamBulkGet[T, U](tableName: TableName, batchSize: Integer, dStream: DStream[T], makeGet: (T) ⇒ Get, convertResult: (Result) ⇒ U)(implicit arg0: ClassTag[U]): DStream[U]

A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.streamMap method.
It allow addition support for a user to take a DStream and generates a new DStream based on Gets and the results they bring back from HBase
tableName
The name of the table to get from
batchSize
The number of Gets to be sent in a single batch
dStream
Original DStream with data to iterate over
makeGet
Function to convert a value in the DStream to a HBase Get
convertResult
This will convert the HBase Result object to what ever the user wants to put in the resulting DStream
returns
A new DStream that is created by the Get to HBase
def streamBulkPut[T](dstream: DStream[T], tableName: TableName, f: (T) ⇒ Put): Unit

A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.streamMapPartition method.
It allow addition support for a user to take a DStream and generate puts and send them to HBase.
The complexity of managing the Connection is removed from the developer
dstream
Original DStream with data to iterate over
tableName
The name of the table to put into
f
Function to convert a value in the DStream to a HBase Put
def streamForeachPartition[T](dstream: DStream[T], f: (Iterator[T], Connection) ⇒ Unit): Unit

A simple enrichment of the traditional Spark Streaming DStream foreachPartition.
A simple enrichment of the traditional Spark Streaming DStream foreachPartition.
This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Note: Make sure to partition correctly to avoid memory issue when getting data from HBase
dstream
Original DStream with data to iterate over
f
Function to be given a iterator to iterate through the DStream values and a Connection object to interact with HBase
returns
Returns a new DStream generated by the user definition function just like normal mapPartition
def streamMapPartitions[T, U](dstream: DStream[T], f: (Iterator[T], Connection) ⇒ Iterator[U])(implicit arg0: ClassTag[U]): DStream[U]

A simple enrichment of the traditional Spark Streaming DStream mapPartition.
A simple enrichment of the traditional Spark Streaming DStream mapPartition.
This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Note: Make sure to partition correctly to avoid memory issue when getting data from HBase
dstream
Original DStream with data to iterate over
f
Function to be given a iterator to iterate through the DStream values and a Connection object to interact with HBase
returns
Returns a new DStream generated by the user definition function just like normal mapPartition
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
val tmpHdfsConfgFile: String
var tmpHdfsConfiguration: Configuration
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

class HBaseContext extends Serializable with Logging

Instance Constructors

new HBaseContext(sc: SparkContext, config: Configuration, tmpHdfsConfgFile: String = null)

Type Members

class WriterLength extends AnyRef

Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

var appliedCredentials: Boolean

def applyCreds[T](): Unit

final def asInstanceOf[T0]: T0

val broadcastedConf: Broadcast[SerializableWritable[Configuration]]

def bulkDelete[T](rdd: RDD[T], tableName: TableName, f: (T) ⇒ Delete, batchSize: Integer): Unit

def bulkGet[T, U](tableName: TableName, batchSize: Integer, rdd: RDD[T], makeGet: (T) ⇒ Get, convertResult: (Result) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]

def bulkPut[T](rdd: RDD[T], tableName: TableName, f: (T) ⇒ Put): Unit

def clone(): AnyRef

val config: Configuration

var credentials: Credentials

val credentialsConf: Broadcast[SerializableWritable[Credentials]]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

def foreachPartition[T](dstream: DStream[T], f: (Iterator[T], Connection) ⇒ Unit): Unit

def foreachPartition[T](rdd: RDD[T], f: (Iterator[T], Connection) ⇒ Unit): Unit

final def getClass(): Class[_]

def hashCode(): Int

def hbaseRDD(tableName: TableName, scans: Scan): RDD[(ImmutableBytesWritable, Result)]

def hbaseRDD[U](tableName: TableName, scan: Scan, f: ((ImmutableBytesWritable, Result)) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]

final def isInstanceOf[T0]: Boolean

def isTraceEnabled(): Boolean

val job: Job

def log: Logger

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

def mapPartitions[T, R](rdd: RDD[T], mp: (Iterator[T], Connection) ⇒ Iterator[R])(implicit arg0: ClassTag[R]): RDD[R]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def streamBulkDelete[T](dstream: DStream[T], tableName: TableName, f: (T) ⇒ Delete, batchSize: Integer): Unit

def streamBulkGet[T, U](tableName: TableName, batchSize: Integer, dStream: DStream[T], makeGet: (T) ⇒ Get, convertResult: (Result) ⇒ U)(implicit arg0: ClassTag[U]): DStream[U]

def streamBulkPut[T](dstream: DStream[T], tableName: TableName, f: (T) ⇒ Put): Unit

def streamForeachPartition[T](dstream: DStream[T], f: (Iterator[T], Connection) ⇒ Unit): Unit

def streamMapPartitions[T, U](dstream: DStream[T], f: (Iterator[T], Connection) ⇒ Iterator[U])(implicit arg0: ClassTag[U]): DStream[U]

final def synchronized[T0](arg0: ⇒ T0): T0

val tmpHdfsConfgFile: String

var tmpHdfsConfiguration: Configuration

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Logging

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped