Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inner_split() methods for bootstrap #488

Merged
merged 2 commits into from
May 23, 2024
Merged

Add inner_split() methods for bootstrap #488

merged 2 commits into from
May 23, 2024

Conversation

hfrick
Copy link
Member

@hfrick hfrick commented May 23, 2024

Unlike the inner_split() methods in #483, it's not straightforward what the splitting mechanism should be here.

Our main concern is data leakage:
The bootrap sample in boot_split or group_boot_split likely contains several replications of an observation. We don't want those to be split up into the inner analysis and inner assessment set. Option 1 in the graph below is an example of that.

Options 2 and 3 in the graph try to avoid this by

  • Option 2: do a grouped resampling with the row ID as the group (or for group_boot_split, the original group combined with the row id). This would mean that rows in the inner assessment set are potentially not unique, unlike the typical bootstrap OOB sample.
  • Option3: sample with replacement from the pool of unique rows in the (outer) analysis set. This prevents the two gripes we have with options 1 and 2. The inner analysis set is not sampled exactly from the same distribution as the outer analysis set but it can't be per definition? It is fairly close though?

Further thoughts:
With each bootrap sampling, we essentially put 1/3 into the assessment set. This could hurt us quickly, especially for small data. "Small data" is a problem for all other sampling procedures as well, but they usually have a dial to turn to affect that proportion.
People could abandon fidelity to the bootstrap idea here by specifying a different sampling procedure for the inner split in add_tailor() when making their workflow.

bootstrap-potato-split

@hfrick hfrick merged commit 7d35a8e into main May 23, 2024
12 checks passed
@hfrick hfrick deleted the inner_split-bootstrap branch May 23, 2024 14:07
Copy link

github-actions bot commented Jun 7, 2024

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant