Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing more than 250MLN from Hive to SolR #23

Open
disoardi opened this issue Mar 20, 2017 · 5 comments
Open

Indexing more than 250MLN from Hive to SolR #23

disoardi opened this issue Mar 20, 2017 · 5 comments
Labels
documentation Issue is with documentation question

Comments

@disoardi
Copy link

Hi all,

We are trying to index more than 250MLN rows from Hive table (ORC format) but we have noticed that the indexing is too slow.

We have 9 SolR nodes (9 shards and 2 replicas per shard) and we have set the maxIndexingThreads parameter to 128 and the ramBufferSizeMB one to 60MB.

While launching the INSERT INTO on the external table, where the hive-serde is used, the servers CPU is idle and the indexig througput is lower than 1MLN per hour.

Since the servers are idle how can we do it faster? We have a lot of CPUs and RAM but we are not able to use them for the indexing process.
Any suggested?
Can be useful to configure any parameters on the client side to use all the threads?
Thanks in advance.

PS: We have set the commit (auto and soft) to 10 minutes or 1MLN of documents.

@acesar
Copy link
Contributor

acesar commented Mar 20, 2017

@disoardi Please share your the Solr table configuration.
Please also share the versions of Solr, Hive/Tez and Yarn (how many Yarn node do you have?)

@disoardi
Copy link
Author

Hive external Table with serde:

ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');

hortonworks Data Platform 2.3.2
Solr version --> 5.5.0 with 20GB of Xmx per nodes
Hive version --> 1.2.1
TEZ version --> 0.7.0
YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

@acesar
Copy link
Contributor

acesar commented Mar 21, 2017

Do you have only one Zookeeper node? Usually the minimum recommended amount of Zookeeper nodes is 3. The zk string should be something like:

   'solr.zkhost' = 'host1:2181,host2:2181,host3:2181/solr'

Can you share the output of the indexing? Are there any errors in the yarn/hive logs?

You can try increasing the the Solr buffer size lww.buffer.docs.size, by default is 500 documents.

The lww.buffer.docs.size can be set as a global hive property, or TBLPROPERTIES

hive> set lww.buffer.docs.size=5000

Some test with 3 Solr/Yarn nodes: (Solr and Yarn were installed in the same node)

 [1000000 hive records]: 
8 shards -> 101.096 seconds
2 shards -> 229.545 seconds

@disoardi
Copy link
Author

Sorry for the dealy, but I found the solution: I set solr.client.threads. The default is 1.

Do you have any documentation about this options?

Thanks in advance

@ctargett ctargett added the documentation Issue is with documentation label May 3, 2018
@NethajiRajamanickam
Copy link

Hive external Table with serde:

ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');

hortonworks Data Platform 2.3.2
Solr version --> 5.5.0 with 20GB of Xmx per nodes
Hive version --> 1.2.1
TEZ version --> 0.7.0
YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

Hive external Table with serde:

ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');

hortonworks Data Platform 2.3.2
Solr version --> 5.5.0 with 20GB of Xmx per nodes
Hive version --> 1.2.1
TEZ version --> 0.7.0
YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

hi
Please share the jar file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issue is with documentation question
Projects
None yet
Development

No branches or pull requests

4 participants