Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IteratorSpliterator returns estimateSize which violates Spliterator's contract breaking parallel stream processing #3027

Closed
codesimplicity opened this issue Jan 15, 2024 · 7 comments
Labels
status: invalid An issue that we don't feel is valid

Comments

@codesimplicity
Copy link

Not sure if I'm missing something, but it looks like elements from a stream returned from Spring Data (in my case it's a MongoRepository) will never be processed in parallel even if I instruct the stream to do so by converting it to a parallel stream.

After some debugging and digging into the details, it looks like the reason is that IteratorSpliterator.estimateSize() does return -1 and not a proper value as indicated in Spliterator's JavaDoc. This causes the fork-join implementation to never fork.

JavaDoc of Spliterator.estimateSize():

Returns an estimate of the number of elements that would be encountered by a forEachRemaining traversal, or returns Long.MAX_VALUE if infinite, unknown, or too expensive to compute.
If this Spliterator is SIZED and has not yet been partially traversed or split, or this Spliterator is SUBSIZED and has not yet been partially traversed, this estimate must be an accurate count of elements that would be encountered by a complete traversal. Otherwise, this estimate may be arbitrarily inaccurate, but must decrease as specified across invocations of trySplit.

Is this simply something that was missed or does this have a particular reason?

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Jan 15, 2024
@mp911de
Copy link
Member

mp911de commented Jan 18, 2024

In the case of MongoDB we're adopting FindIterable via CloseableIterator. We specifically do not enable parallel streams (StreamSupport.stream(this.spliterator(), false).onClose(this::close)).

Have you tried to spike on parallelization by adopting a FindIterable into Stream?

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label Jan 18, 2024
@spring-projects-issues
Copy link

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@spring-projects-issues spring-projects-issues added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label Jan 25, 2024
@codesimplicity
Copy link
Author

Thanks for the hint, returning Iterable and converting that into a parallel Stream does the trick 👍

Just wondering if that is documented anywhere that the returned Stream does not support parallel processing while others like Iterable do.
At least Repository query return types does not mention it and my attempts to find anything before raising this issue also didn't bring up anything.
I'd honestly prefer if you'd get an error if you try to switch the non-parallel stream to parallel, but those differences should at least be documented 😄


For reference, below is the code I used to test this. This is Kotlin code using a MongoRepository.

Repository method (if you'd use the magic findAll or findBy method names, you wouldn't need the @Query annotation):

    @Query("{}")
    fun all(): Iterable<Item>

Using it:

    fun testParallelItemProcessing() {
        StreamSupport.stream(itemDao.all().spliterator(), true).use { stream ->
            val sum = stream
                    .map {
                        val size = it.size
                        logger.info { "Item has size $size" }
                        size
                    }
                    .reduce(Int::plus)
                    .orElse(0)
            logger.info { "Total size: $sum" }
        }
    }

Logging shows that multiple threads are involved in 'processing' the items.

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue status: feedback-reminder We've sent a reminder that we need additional information before we can continue labels Jan 25, 2024
@codesimplicity
Copy link
Author

Just realized that by using Iterable (or Streamable as suggested in the docs), I will loose the streaming behavior, i.e. all results would be fetched at once 😱
Consuming Large Query Results

Does that mean that I can EITHER process results in chunks OR process results in parallel?
Is there no option to fetch results in chunks AND process them in parallel?

@mp911de
Copy link
Member

mp911de commented Jan 26, 2024

Iterable and Streamable are backed by List and that does come with full result materialization. If you want truly parallelize result consumption, how about using the Reactive API? A Flux gives you a wide variety of options that are otherwise difficult to achieve with a normal Stream.

@codesimplicity
Copy link
Author

So the answer is I can only get both when using the Reactive API?

This would mean to setup a parallel reactive set of code (reactive MongoDB connection, reactive repository, reactive access & transformations). That's something we may need to consider 🤔

@mp911de
Copy link
Member

mp911de commented Jan 29, 2024

For the time being, this is correct. Would you mind reaching out to the MongoDB team asking to make their MongoIterable Java 8 Stream-friendlier? A native stream() method would be all that it requires to fully benefit from Mongo's underlying driver API.

@mp911de mp911de closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
@mp911de mp911de added status: invalid An issue that we don't feel is valid and removed status: waiting-for-triage An issue we've not yet triaged status: feedback-provided Feedback has been provided labels Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: invalid An issue that we don't feel is valid
Projects
None yet
Development

No branches or pull requests

3 participants