This is a wrapper class around StoreFileWriter.
A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.foreachPartition method.
It allow addition support for a user to take a RDD and generate delete and send them to HBase. The complexity of managing the Connection is removed from the developer
Original RDD with data to iterate over
The name of the table to delete from
Function to convert a value in the RDD to a HBase Deletes
The number of delete to batch before sending to HBase
A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.mapPartition method.
It allow addition support for a user to take a RDD and generates a new RDD based on Gets and the results they bring back from HBase
The name of the table to get from
Original RDD with data to iterate over
function to convert a value in the RDD to a HBase Get
This will convert the HBase Result object to what ever the user wants to put in the resulting RDD return new RDD that is created by the Get to HBase
Spark Implementation of HBase Bulk load for wide rows or when values are not already combined at the time of the map process
Spark Implementation of HBase Bulk load for wide rows or when values are not already combined at the time of the map process
This will take the content from an existing RDD then sort and shuffle it with respect to region splits. The result of that sort and shuffle will be written to HFiles.
After this function is executed the user will have to call LoadIncrementalHFiles.doBulkLoad(...) to move the files into HBase
Also note this version of bulk load is different from past versions in that it includes the qualifier as part of the sort process. The reason for this is to be able to support rows will very large number of columns.
The Type of values in the original RDD
The RDD we are bulk loading from
The HBase table we are loading into
A flapMap function that will make every row in the RDD into N cells for the bulk load
The location on the FileSystem to bulk load into
Options that will define how the HFile for a column family is written
Compaction excluded for the HFiles
Max size for the HFiles before they roll
Spark Implementation of HBase Bulk load for short rows some where less then a 1000 columns.
Spark Implementation of HBase Bulk load for short rows some where less then a 1000 columns. This bulk load should be faster for tables will thinner rows then the other spark implementation of bulk load that puts only one value into a record going into a shuffle
This will take the content from an existing RDD then sort and shuffle it with respect to region splits. The result of that sort and shuffle will be written to HFiles.
After this function is executed the user will have to call LoadIncrementalHFiles.doBulkLoad(...) to move the files into HBase
In this implementation, only the rowKey is given to the shuffle as the key and all the columns are already linked to the RowKey before the shuffle stage. The sorting of the qualifier is done in memory out side of the shuffle stage
Also make sure that incoming RDDs only have one record for every row key.
The Type of values in the original RDD
The RDD we are bulk loading from
The HBase table we are loading into
A function that will convert the RDD records to the key value format used for the shuffle to prep for writing to the bulk loaded HFiles
The location on the FileSystem to bulk load into
Options that will define how the HFile for a column family is written
Compaction excluded for the HFiles
Max size for the HFiles before they roll
A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.foreachPartition method.
It allow addition support for a user to take RDD and generate puts and send them to HBase. The complexity of managing the Connection is removed from the developer
Original RDD with data to iterate over
The name of the table to put into
Function to convert a value in the RDD to a HBase Put
A simple enrichment of the traditional Spark Streaming dStream foreach This function differs from the original in that it offers the developer access to a already connected Connection object
A simple enrichment of the traditional Spark Streaming dStream foreach This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Original DStream with data to iterate over
Function to be given a iterator to iterate through the DStream values and a Connection object to interact with HBase
A simple enrichment of the traditional Spark RDD foreachPartition.
A simple enrichment of the traditional Spark RDD foreachPartition. This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Original RDD with data to iterate over
Function to be given a iterator to iterate through the RDD values and a Connection object to interact with HBase
A overloaded version of HBaseContext hbaseRDD that defines the type of the resulting RDD
A overloaded version of HBaseContext hbaseRDD that defines the type of the resulting RDD
the name of the table to scan
the HBase scan object to use to read data from HBase
New RDD with results from scan
This function will use the native HBase TableInputFormat with the given scan object to generate a new RDD
This function will use the native HBase TableInputFormat with the given scan object to generate a new RDD
the name of the table to scan
the HBase scan object to use to read data from HBase
function to convert a Result object from HBase into what the user wants in the final generated RDD
new RDD with results from scan
A simple enrichment of the traditional Spark RDD mapPartition.
A simple enrichment of the traditional Spark RDD mapPartition. This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Original RDD with data to iterate over
Function to be given a iterator to iterate through the RDD values and a Connection object to interact with HBase
Returns a new RDD generated by the user definition function just like normal mapPartition
A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.streamBulkMutation method.
It allow addition support for a user to take a DStream and generate Delete and send them to HBase.
The complexity of managing the Connection is removed from the developer
Original DStream with data to iterate over
The name of the table to delete from
function to convert a value in the DStream to a HBase Delete
The number of deletes to batch before sending to HBase
A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.streamMap method.
It allow addition support for a user to take a DStream and generates a new DStream based on Gets and the results they bring back from HBase
The name of the table to get from
The number of Gets to be sent in a single batch
Original DStream with data to iterate over
Function to convert a value in the DStream to a HBase Get
This will convert the HBase Result object to what ever the user wants to put in the resulting DStream
A new DStream that is created by the Get to HBase
A simple abstraction over the HBaseContext.
A simple abstraction over the HBaseContext.streamMapPartition method.
It allow addition support for a user to take a DStream and generate puts and send them to HBase.
The complexity of managing the Connection is removed from the developer
Original DStream with data to iterate over
The name of the table to put into
Function to convert a value in the DStream to a HBase Put
A simple enrichment of the traditional Spark Streaming DStream foreachPartition.
A simple enrichment of the traditional Spark Streaming DStream foreachPartition.
This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Note: Make sure to partition correctly to avoid memory issue when getting data from HBase
Original DStream with data to iterate over
Function to be given a iterator to iterate through the DStream values and a Connection object to interact with HBase
Returns a new DStream generated by the user definition function just like normal mapPartition
A simple enrichment of the traditional Spark Streaming DStream mapPartition.
A simple enrichment of the traditional Spark Streaming DStream mapPartition.
This function differs from the original in that it offers the developer access to a already connected Connection object
Note: Do not close the Connection object. All Connection management is handled outside this method
Note: Make sure to partition correctly to avoid memory issue when getting data from HBase
Original DStream with data to iterate over
Function to be given a iterator to iterate through the DStream values and a Connection object to interact with HBase
Returns a new DStream generated by the user definition function just like normal mapPartition
HBaseContext is a façade for HBase operations like bulk put, get, increment, delete, and scan
HBaseContext will take the responsibilities of disseminating the configuration information to the working and managing the life cycle of Connections.