Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix S3Store retry might cause poisoned data #1383

Merged

Conversation

allada
Copy link
Member

@allada allada commented Oct 2, 2024

If using S3Store + VerifyStore is in use, S3 store could receive data that VerifyStore deemed invalid due to retry logic.

This only effects S3Store + VerifyStore due to the way AwsS3Sdk crate works, we need to hold recent data in the BufChannel, in the event VerifyStore got an invalid hash, the retry logic in S3 would trigger, but instead of being seen as "invalid" it would actually have stored the sent data and sent it, causing S3 to to still receive the invalid data. This PR causes BuffChannel logic to set a flag making the next read in S3 store always trigger the error condition.


This change is Reviewable

If using S3Store + VerifyStore is in use, S3 store could receive data
that VerifyStore deemed invalid due to retry logic.

This only effects S3Store + VerifyStore due to the way AwsS3Sdk crate
works, we need to hold recent data in the BufChannel, in the event
VerifyStore got an invalid hash, the retry logic in S3 would trigger,
but instead of being seen as "invalid" it would actually have stored the
sent data and sent it, causing S3 to to still receive the invalid data.
This PR causes BuffChannel logic to set a flag making the next read in
S3 store always trigger the error condition.
@allada allada force-pushed the fix-buf_channel-reset-stream-failure branch from d6801d2 to b71f04f Compare October 2, 2024 06:05
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@adam-singer

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 3 files reviewed, and pending CI: pre-commit-checks (waiting on @adam-singer)


nativelink-store/src/s3_store.rs line 437 at r1 (raw file):

                .retrier
                .retry(unfold(reader, move |mut reader| async move {
                    let UploadSizeInfo::ExactSize(sz) = upload_size else {

fyi: Just seemed weird to do this in retry. Not related to this PR though.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Installation / macos-13, Remote / large-ubuntu-22.04, and 1 discussions need to be resolved


nativelink-util/src/buf_channel.rs line 222 at r2 (raw file):

                self.queued_data.clear();
                self.recent_data.clear();
                self.bytes_received = 0;

Should this be part of some sort of "reset stream" or "reset channel" type function? Would there be other cases where we need to reset/clear?

Code quote:

                self.queued_data.clear();
                self.recent_data.clear();
                self.bytes_received = 0;

Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained, and all files reviewed


nativelink-util/src/buf_channel.rs line 222 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Should this be part of some sort of "reset stream" or "reset channel" type function? Would there be other cases where we need to reset/clear?

I originally had it this way, but then realized the only way an error can happen is if it drops early. We don't expose a way for an error to be sent from the sender side.

@allada allada merged commit e6eb5f7 into TraceMachina:main Oct 2, 2024
31 checks passed
@allada allada deleted the fix-buf_channel-reset-stream-failure branch October 2, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants