Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination not working more than once #956

Closed
ivmarkp opened this issue Jul 18, 2018 · 8 comments
Closed

Pagination not working more than once #956

ivmarkp opened this issue Jul 18, 2018 · 8 comments

Comments

@ivmarkp
Copy link

ivmarkp commented Jul 18, 2018

Slicing seems to work only once for me. After debugging, I found that the size is set to 0 in s even though I assign it to 1000 in the while loop below. Is this a bug I just found or something's wrong with my code itself?

s = Search(using=es, index=properties.es_index) \
     .query("match_phrase", field= text)

lefthits = s.count()
start = 0
s = s[0:1000]
results = s.execute()

count = 0
while len(results.hits.hits) > 0:
  for hit in results:
    count += 1
  print(count)
  lefthits = results.hits.total - count
  if lefthits > 0:
    start += 1000
    s = s[start:1000]
    results = s.execute()
@ivmarkp
Copy link
Author

ivmarkp commented Jul 18, 2018

In console, count is printed as 1000 once and while condition len(results.hits.hits) fails thereafter as it's zero after s was updated once in the loop.

@ivmarkp ivmarkp closed this as completed Jul 18, 2018
@honzakral
Copy link
Contributor

We use python-style slicing so it is [start:stop] and not [start:size]

@ivmarkp
Copy link
Author

ivmarkp commented Jul 19, 2018

Yes, thanks @honzakral, I realized soon after opening this issue.

I have another query relate to this. Could you tell me what your thoughts are on estimated time for s.execute() to complete if let's say I set stop to 300K to 400K?

@honzakral
Copy link
Contributor

@ivmarkp that is not supported - by default the maximum size of a page is 10K. If you want to retrieve more documents I recommend you look at the scan method which will fetch all the results in batches by using the underlying scan helper from elasticsearch-py (0)

0 - http://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

@ivmarkp
Copy link
Author

ivmarkp commented Jul 20, 2018

I've actually set "index.max_result_window" : "1000000". So far I haven't faced any issue. I couldn't find anything related to performance impact due to it. Could you elaborate a bit in this regard?

I tried scan method but since it fetches all matching documents it takes forever to fetch documents more than a few hundred thousand. My requirement is really to fetch even more i.e. around 1-2 million. Pagination using slicing I mentioned works so far but I haven't had a chance to experiment with a master index than contains million documents yet so I'm not sure if this is a good idea.

@honzakral
Copy link
Contributor

Pagination over millions of documents is definitely not a good idea since the memory and computational cost of every consecutive page is more expensive than the previous (for the last page all of the millions of documents will need to be sorted in memory).

That is why we have the scan API that circumvents this by storing the intermediate state between requests. You can also have a look at using search_after for more efficient pagination (see #806). For this use case, exporting a lot of documents, I would definitely recommend using the scan helper as it was exactly what it was meant to do. You can try tuning parameters like size and preserve_order so control the behavior and hopefully get the performance you need.

@ivmarkp
Copy link
Author

ivmarkp commented Jul 21, 2018

Thanks @honzakral for your suggestions.

I recommend you look at the scan method which will fetch all the results in batches

Are you talking about this method? Its documentation says...

Use params method to specify any additional arguments you with to pass to the underlying scan helper from elasticsearch-py

I don't find it much helpful and example usage is also missing. I searched through the documentation but couldn't find anything that talks about using params method to specify additional arguments. Could show me with an example how this method may be used for my usecase?

On a side note, I had another query; since we are going to export millions of docs. I was thinking if there was a way to continue scan from a particular point like we are able to do easily with pagination / slicing? (as you can imagine, continuing like this saves us a ton of time).

@honzakral
Copy link
Contributor

honzakral commented Jul 21, 2018

I don't find it much helpful and example usage is also missing. I searched through the documentation but couldn't find anything that talks about using params method to specify additional arguments. Could show me with an example how this method may be used for my usecase?

Search().params(preserve_order=True).scan()

On a side note, I had another query; since we are going to export millions of docs. I was thinking if there was a way to continue scan from a particular point like we are able to do easily with pagination / slicing? (as you can imagine, continuing like this saves us a ton of time).

For that use the search_after functionality that I pointed you to earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants