Pagination not working more than once #956

ivmarkp · 2018-07-18T11:07:56Z

Slicing seems to work only once for me. After debugging, I found that the size is set to 0 in s even though I assign it to 1000 in the while loop below. Is this a bug I just found or something's wrong with my code itself?

s = Search(using=es, index=properties.es_index) \
     .query("match_phrase", field= text)

lefthits = s.count()
start = 0
s = s[0:1000]
results = s.execute()

count = 0
while len(results.hits.hits) > 0:
  for hit in results:
    count += 1
  print(count)
  lefthits = results.hits.total - count
  if lefthits > 0:
    start += 1000
    s = s[start:1000]
    results = s.execute()

The text was updated successfully, but these errors were encountered:

ivmarkp · 2018-07-18T11:10:50Z

In console, count is printed as 1000 once and while condition len(results.hits.hits) fails thereafter as it's zero after s was updated once in the loop.

honzakral · 2018-07-18T12:10:29Z

We use python-style slicing so it is [start:stop] and not [start:size]

ivmarkp · 2018-07-19T05:50:42Z

Yes, thanks @honzakral, I realized soon after opening this issue.

I have another query relate to this. Could you tell me what your thoughts are on estimated time for s.execute() to complete if let's say I set stop to 300K to 400K?

honzakral · 2018-07-19T17:38:12Z

@ivmarkp that is not supported - by default the maximum size of a page is 10K. If you want to retrieve more documents I recommend you look at the scan method which will fetch all the results in batches by using the underlying scan helper from elasticsearch-py (0)

0 - http://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

ivmarkp · 2018-07-20T11:41:00Z

I've actually set "index.max_result_window" : "1000000". So far I haven't faced any issue. I couldn't find anything related to performance impact due to it. Could you elaborate a bit in this regard?

I tried scan method but since it fetches all matching documents it takes forever to fetch documents more than a few hundred thousand. My requirement is really to fetch even more i.e. around 1-2 million. Pagination using slicing I mentioned works so far but I haven't had a chance to experiment with a master index than contains million documents yet so I'm not sure if this is a good idea.

honzakral · 2018-07-20T13:58:13Z

Pagination over millions of documents is definitely not a good idea since the memory and computational cost of every consecutive page is more expensive than the previous (for the last page all of the millions of documents will need to be sorted in memory).

That is why we have the scan API that circumvents this by storing the intermediate state between requests. You can also have a look at using search_after for more efficient pagination (see #806). For this use case, exporting a lot of documents, I would definitely recommend using the scan helper as it was exactly what it was meant to do. You can try tuning parameters like size and preserve_order so control the behavior and hopefully get the performance you need.

ivmarkp · 2018-07-21T12:02:58Z

Thanks @honzakral for your suggestions.

I recommend you look at the scan method which will fetch all the results in batches

Are you talking about this method? Its documentation says...

Use params method to specify any additional arguments you with to pass to the underlying scan helper from elasticsearch-py

I don't find it much helpful and example usage is also missing. I searched through the documentation but couldn't find anything that talks about using params method to specify additional arguments. Could show me with an example how this method may be used for my usecase?

On a side note, I had another query; since we are going to export millions of docs. I was thinking if there was a way to continue scan from a particular point like we are able to do easily with pagination / slicing? (as you can imagine, continuing like this saves us a ton of time).

honzakral · 2018-07-21T15:12:18Z

I don't find it much helpful and example usage is also missing. I searched through the documentation but couldn't find anything that talks about using params method to specify additional arguments. Could show me with an example how this method may be used for my usecase?

Search().params(preserve_order=True).scan()

On a side note, I had another query; since we are going to export millions of docs. I was thinking if there was a way to continue scan from a particular point like we are able to do easily with pagination / slicing? (as you can imagine, continuing like this saves us a ton of time).

For that use the search_after functionality that I pointed you to earlier.

ivmarkp closed this as completed Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pagination not working more than once #956

Pagination not working more than once #956

ivmarkp commented Jul 18, 2018 •

edited

Loading

ivmarkp commented Jul 18, 2018 •

edited

Loading

honzakral commented Jul 18, 2018

ivmarkp commented Jul 19, 2018

honzakral commented Jul 19, 2018

ivmarkp commented Jul 20, 2018 •

edited

Loading

honzakral commented Jul 20, 2018

ivmarkp commented Jul 21, 2018

honzakral commented Jul 21, 2018 •

edited

Loading

Pagination not working more than once #956

Pagination not working more than once #956

Comments

ivmarkp commented Jul 18, 2018 • edited Loading

ivmarkp commented Jul 18, 2018 • edited Loading

honzakral commented Jul 18, 2018

ivmarkp commented Jul 19, 2018

honzakral commented Jul 19, 2018

ivmarkp commented Jul 20, 2018 • edited Loading

honzakral commented Jul 20, 2018

ivmarkp commented Jul 21, 2018

honzakral commented Jul 21, 2018 • edited Loading

ivmarkp commented Jul 18, 2018 •

edited

Loading

ivmarkp commented Jul 18, 2018 •

edited

Loading

ivmarkp commented Jul 20, 2018 •

edited

Loading

honzakral commented Jul 21, 2018 •

edited

Loading