Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added splitfile support to the Create table command #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ravipesala
Copy link

User can create the splits to the table by using following command ex : CREATE TABLE testrav4(bytecol BYTE, shortcol SHORT, intcol INTEGER, longcol LONG, floatcol FLOAT, PRIMARY KEY(intcol,shortcol)) MAPPED BY (testhbaseravi4, COLS=[bytecol=cf1.hbytecol, longcol=cf2.hlongcol, floatcol=cf2.hfloatcol]) SPLITSFILE = 'D:/1.txt'

def string2Key(values: Seq[String],
lineBuffer: Array[BytesUtils],
keyColumns: Seq[AbstractColumn],
keyBytes: Array[(Array[Byte], DataType)]) = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only in createSplitKeys we use this method, right? if so i think we no need create this method in HBaseKVHelper

@scwf
Copy link

scwf commented Jan 29, 2015

Just a minor comment, @yzhou2001 can you take a look at this?

@yzhou2001
Copy link
Member

Basically my questions are:

  1. What are the use scenarios? Does Hbase itself has such a similar mechanism?
  2. What're typical sizes of the split file? If they can fit into memory, we do not need to read them from HDFS file and can invoke client-only operations without going through the RDD and Spark runtime?
  3. We need test coverage of this new functionality.

Thanks.

@scwf
Copy link

scwf commented Jan 30, 2015

@yzhou2001
1 this PR make our create cmd can create a presplit hbase table, before this we can not do that, only create a non-split table by create cmd, right?

2 i think split file should be very small size and can fit into memory, so maybe we can put them with cmd just like:
CREATE TABLE testrav(strcol STRING, bytecol String, shortcol String, PRIMARY KEY(strcol)) MAPPED BY (testravhbase, COLS=[bytecol=cf1.hbytecol, shortcol=cf1.hshortcol]) SPLITS=['a','b','c']

3 yes, we need test it.

PS:
And actually my initial idea is to set reduce # for bulkload of non-split table, accidently it become this:)

Also a question here: do you think it is necessary to control the # of reduce for bulk load of non-split table now?

@yzhou2001
Copy link
Member

For the table creation, I think a focal point is how much we should build on top of the semantics of "creation of a RDB table on a nonexistent HBase table". The problem is that the more functionalities built into this semantics, the more difficult to reconcile with a possibly existing HBase table. In summary, this is something good to have, but has to be designed carefully to have a clear semantics.

On the reducers, yes, a configurable reducer would be great. But, again, there is some complexity to it, mainly because we probably need a "splitter" class like in HBase. It's feasible but probably not a priority as of now. Right now, we're anxious to get basic functionalities work and obtain some advantageous performance data, in order to produce some weight behind the push for our technology. All the value-adding features/optimizations can be put off 'til a future release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants