Indexing more than 250MLN from Hive to SolR #23

disoardi · 2017-03-20T20:45:02Z

Hi all,

We are trying to index more than 250MLN rows from Hive table (ORC format) but we have noticed that the indexing is too slow.

We have 9 SolR nodes (9 shards and 2 replicas per shard) and we have set the maxIndexingThreads parameter to 128 and the ramBufferSizeMB one to 60MB.

While launching the INSERT INTO on the external table, where the hive-serde is used, the servers CPU is idle and the indexig througput is lower than 1MLN per hour.

Since the servers are idle how can we do it faster? We have a lot of CPUs and RAM but we are not able to use them for the indexing process.
Any suggested?
Can be useful to configure any parameters on the client side to use all the threads?
Thanks in advance.

PS: We have set the commit (auto and soft) to 10 minutes or 1MLN of documents.

acesar · 2017-03-20T21:03:00Z

@disoardi Please share your the Solr table configuration.
Please also share the versions of Solr, Hive/Tez and Yarn (how many Yarn node do you have?)

disoardi · 2017-03-21T08:37:41Z

Hive external Table with serde:

ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');

hortonworks Data Platform 2.3.2
Solr version --> 5.5.0 with 20GB of Xmx per nodes
Hive version --> 1.2.1
TEZ version --> 0.7.0
YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

acesar · 2017-03-21T15:02:29Z

Do you have only one Zookeeper node? Usually the minimum recommended amount of Zookeeper nodes is 3. The zk string should be something like:

   'solr.zkhost' = 'host1:2181,host2:2181,host3:2181/solr'

Can you share the output of the indexing? Are there any errors in the yarn/hive logs?

You can try increasing the the Solr buffer size lww.buffer.docs.size, by default is 500 documents.

The lww.buffer.docs.size can be set as a global hive property, or TBLPROPERTIES

hive> set lww.buffer.docs.size=5000

Some test with 3 Solr/Yarn nodes: (Solr and Yarn were installed in the same node)

 [1000000 hive records]: 
8 shards -> 101.096 seconds
2 shards -> 229.545 seconds

disoardi · 2017-04-14T10:49:35Z

Sorry for the dealy, but I found the solution: I set solr.client.threads. The default is 1.

Do you have any documentation about this options?

Thanks in advance

NethajiRajamanickam · 2018-12-28T16:09:41Z

Hive external Table with serde:
ADD JAR /home/my_comp/solr-hive-serde-2.2.1.jar;

CREATE EXTERNAL TABLE if not exists  my_comp_solr.my_comp_user_number(
id STRING,
cod_1 STRING,
cod_2 STRING,
cod_tipo STRING,
flg_delete int)
STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
     LOCATION '/user/solr/my_comp_user_number'
     TBLPROPERTIES('solr.zkhost' = 'xxx.xxx.xxx.xxx:2181/solr',
                   'solr.collection' = 'my_comp_user_number',
                   'solr.query' = '*:*');
hortonworks Data Platform 2.3.2
Solr version --> 5.5.0 with 20GB of Xmx per nodes
Hive version --> 1.2.1
TEZ version --> 0.7.0
YARN version --> 2.7.1

We have 9 Yarn nodes with 96GB per nodes ( TOT 864GB of YARN queue)

Thanks in advance

hi
Please share the jar file

acesar added the question label Mar 20, 2017

ctargett added the documentation Issue is with documentation label May 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing more than 250MLN from Hive to SolR #23

Indexing more than 250MLN from Hive to SolR #23

disoardi commented Mar 20, 2017

acesar commented Mar 20, 2017

disoardi commented Mar 21, 2017

acesar commented Mar 21, 2017 •

edited

Loading

disoardi commented Apr 14, 2017

NethajiRajamanickam commented Dec 28, 2018

Indexing more than 250MLN from Hive to SolR #23

Indexing more than 250MLN from Hive to SolR #23

Comments

disoardi commented Mar 20, 2017

acesar commented Mar 20, 2017

disoardi commented Mar 21, 2017

acesar commented Mar 21, 2017 • edited Loading

disoardi commented Apr 14, 2017

NethajiRajamanickam commented Dec 28, 2018

acesar commented Mar 21, 2017 •

edited

Loading