Added splitfile support to the Create table command #1

ravipesala · 2015-01-28T12:38:32Z

User can create the splits to the table by using following command ex : CREATE TABLE testrav4(bytecol BYTE, shortcol SHORT, intcol INTEGER, longcol LONG, floatcol FLOAT, PRIMARY KEY(intcol,shortcol)) MAPPED BY (testhbaseravi4, COLS=[bytecol=cf1.hbytecol, longcol=cf2.hlongcol, floatcol=cf2.hfloatcol]) SPLITSFILE = 'D:/1.txt'

scwf · 2015-01-29T11:59:41Z

src/main/scala/org/apache/spark/sql/hbase/util/HBaseKVHelper.scala

+  def string2Key(values: Seq[String],
+                lineBuffer: Array[BytesUtils],
+                keyColumns: Seq[AbstractColumn],
+                keyBytes: Array[(Array[Byte], DataType)]) = {


only in createSplitKeys we use this method, right? if so i think we no need create this method in HBaseKVHelper

scwf · 2015-01-29T12:02:15Z

Just a minor comment, @yzhou2001 can you take a look at this?

yzhou2001 · 2015-01-30T01:40:47Z

Basically my questions are:

What are the use scenarios? Does Hbase itself has such a similar mechanism?
What're typical sizes of the split file? If they can fit into memory, we do not need to read them from HDFS file and can invoke client-only operations without going through the RDD and Spark runtime?
We need test coverage of this new functionality.

Thanks.

scwf · 2015-01-30T01:57:14Z

@yzhou2001
1 this PR make our create cmd can create a presplit hbase table, before this we can not do that, only create a non-split table by create cmd, right?

2 i think split file should be very small size and can fit into memory, so maybe we can put them with cmd just like:
CREATE TABLE testrav(strcol STRING, bytecol String, shortcol String, PRIMARY KEY(strcol)) MAPPED BY (testravhbase, COLS=[bytecol=cf1.hbytecol, shortcol=cf1.hshortcol]) SPLITS=['a','b','c']

3 yes, we need test it.

PS:
And actually my initial idea is to set reduce # for bulkload of non-split table, accidently it become this:)

Also a question here: do you think it is necessary to control the # of reduce for bulk load of non-split table now?

yzhou2001 · 2015-01-30T03:00:30Z

For the table creation, I think a focal point is how much we should build on top of the semantics of "creation of a RDB table on a nonexistent HBase table". The problem is that the more functionalities built into this semantics, the more difficult to reconcile with a possibly existing HBase table. In summary, this is something good to have, but has to be designed carefully to have a clear semantics.

On the reducers, yes, a configurable reducer would be great. But, again, there is some complexity to it, mainly because we probably need a "splitter" class like in HBase. It's feasible but probably not a priority as of now. Right now, we're anxious to get basic functionalities work and obtain some advantageous performance data, in order to produce some weight behind the push for our technology. All the value-adding features/optimizations can be put off 'til a future release.

Added splitfile support to the Create table command

9f7432e

ravipesala assigned scwf Jan 28, 2015

scwf reviewed Jan 29, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added splitfile support to the Create table command #1

Added splitfile support to the Create table command #1

ravipesala commented Jan 28, 2015

scwf Jan 29, 2015

scwf commented Jan 29, 2015

yzhou2001 commented Jan 30, 2015

scwf commented Jan 30, 2015

yzhou2001 commented Jan 30, 2015

Added splitfile support to the Create table command #1

Are you sure you want to change the base?

Added splitfile support to the Create table command #1

Conversation

ravipesala commented Jan 28, 2015

scwf Jan 29, 2015

Choose a reason for hiding this comment

scwf commented Jan 29, 2015

yzhou2001 commented Jan 30, 2015

scwf commented Jan 30, 2015

yzhou2001 commented Jan 30, 2015