MultiTableSnapshotInputFormat generalizes
TableSnapshotInputFormat
allowing a MapReduce
job to run over one or more table snapshots, with one or more scans configured for each.
Internally, the input format delegates to
TableSnapshotInputFormat
and thus has the same
performance advantages; see
TableSnapshotInputFormat
for more details. Usage is similar
to TableSnapshotInputFormat, with the following exception: initMultiTableSnapshotMapperJob takes
in a map from snapshot name to a collection of scans. For each snapshot in the map, each
corresponding scan will be applied; the overall dataset for the job is defined by the
concatenation of the regions and tables included in each snapshot/scan pair.
(Map, Class, Class, Class, org.apache.hadoop.mapreduce.Job, boolean, Path)
can be used to configure the job.
Job job = new Job(conf);
Map<String, Collection<Scan>> snapshotScans = ImmutableMap.of(
"snapshot1", ImmutableList.of(new Scan(Bytes.toBytes("a"), Bytes.toBytes("b"))),
"snapshot2", ImmutableList.of(new Scan(Bytes.toBytes("1"), Bytes.toBytes("2")))
);
Path restoreDir = new Path("/tmp/snapshot_restore_dir")
TableMapReduceUtil.initTableSnapshotMapperJob(
snapshotScans, MyTableMapper.class, MyMapKeyOutput.class,
MyMapOutputValueWritable.class, job, true, restoreDir);
Internally, this input format restores each snapshot into a subdirectory of the given tmp
directory. Input splits and record readers are created as described in
TableSnapshotInputFormat
(one per region). See
TableSnapshotInputFormat
for more notes on permissioning; the same caveats apply here.