Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReactorNettyClient stucked on cancelled Conversation if that Conversation has more than 256 rows (size of reactor.bufferSize.small) #661

Open
alexeykurshakov opened this issue Sep 25, 2024 · 7 comments
Labels
status: waiting-for-triage An issue we've not yet triaged

Comments

@alexeykurshakov
Copy link

Bug Report

Versions

  • Driver: 1.0.5
  • Database: PostgreSQL 13.12
  • Java: 17
  • OS: MacOS, Linux

Current Behavior

When you have query zipped in parallel with some other failed function and that query return more than 256 rows it can leads to the case when you no have real consumer, because chain was cancelled, but you receive data from database that start to save it to ReactorNettyClinet.buffer.
When this happens, any other attempts to get data from the database will fail because ReactorNettyClient.BackendMessageSubscriber.tryDrainLoop never call drainLoop because stucked conversation no have demands

private void tryDrainLoop() {
    while (hasBufferedItems() && hasDownstreamDemand()) {
        if (!drainLoop()) {
            return;
        }
    }
 }

Can reproduce using https://github.com/agorbachenko/r2dbc-connection-leak-demo
If you increase System property "reactor.bufferSize.small" to 350, the attached example will start working

@mp911de
Copy link
Collaborator

mp911de commented Sep 25, 2024

Thanks a lot for chasing this issue down. Since you invested about 80% of the effort that is required to fix the issue, do you want to submit a pull request to clear out the cancelled conversations?

@alexeykurshakov
Copy link
Author

I've never worked before with reactor library (mono, flux). But I found that it's not easy to track down what is the source of cancellation - error in parallel zip function, ordinal cancel or cancellation from Mono.from(fluxPublisher).
For example

  Mono.from(Flux.just(1, 2, 3).doOnCancel(() -> {
                System.out.println("fire");
            })).subscribe();

will fire println with first emit
and

 Flux.just(1, 2, 3).doOnCancel(() -> {
                System.out.println("fire");
            }).subscribe();

no println "fire"
If you can help me track down the type of cancellation, sure, I can make a pull request.

@chemicL
Copy link

chemicL commented Sep 25, 2024

@alexeykurshakov these cancellations have reasonable explanations. A couple examples:

  • Mono.from(Publisher).subscribe() cancels the Publisher once the first item is emitted, as Mono expects at most item to be emitted to the Subscriber.
  • Flux.just(T...).subscribe() has no reason to cancel at all, as multiple items adhere to the Flux specification.
  • Flux.zip(Publisher, Publisher).subscribe() will cancel the other Publisher once one of them completes/errors.

For inspiration regarding test cases, perhaps you can use my examples with mocks. This was part of the investigation whether the r2dbc-pool is responsible for the connection leaks in r2dbc/r2dbc-pool#198 (comment).

@alexeykurshakov
Copy link
Author

@mp911de

.as(source -> Operators.discardOnCancel(source, () -> {
if you in SimpleQueryMessageFlow.exchange the original cancellation just ignored. I don't understand the correct behaviour
Why you discard cancellation with Operators.discardOnCancel and what .doOnDiscard(ReferenceCounted.class, ReferenceCountUtil::release) should do?

@mp911de
Copy link
Collaborator

mp911de commented Sep 26, 2024

Operators.discardOnCancel is to drain protocol frames off the transport so that we can finalize the conversation with the server. If we just cancelled the consumption, then response frames from an earlier conversation would remain on the transport and feed into the next conversation.

@alexeykurshakov
Copy link
Author

Sounds like it should works, but not 🤣. According to an issue example badThread never consumed data and sending cancel signal after real data feed ReactorNettyClient that leads to the case when it saved this messages in internal buffer. So in that example discard happened too late.

@alexeykurshakov
Copy link
Author

I can provide a timeline of what happened. And then we'll figure out how to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage An issue we've not yet triaged
Projects
None yet
Development

No branches or pull requests

3 participants