Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could make_strata() warn (or remove the strata attribute) when only returning a single strata? Or message when pooling at all? #441

Open
mikemahoney218 opened this issue Jul 27, 2023 · 1 comment
Labels
feature a feature request or enhancement

Comments

@mikemahoney218
Copy link
Member

Feature

I was reminded about #438 by the GitHub lock bot, an issue where a user was surprised that vfold_cv() (and eventually make_strata()) "didn't stratify" (or rather, treated the data as only having one stratum) when the stratification variable only had one class above the pooling threshold.

I think rsample is doing the right thing here, and behaving as documented, but this behavior is still a bit surprising. Would it be possible for make_strata() to warn when it only returns a single stratum? I imagine this is almost always unintentional, as users wouldn't specify a stratification variable if they thought it would go unused.

Another consideration here is that, even if only one stratum is created, the rset objects still contain a strata attribute. As a result, when printed these objects claim that they were created "using stratification":

data.frame(
  x = rnorm(100), 
  y = c(rep("a", 99), "b")
) |> 
  rsample::vfold_cv(strata = y)
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits          id    
#>    <list>          <chr> 
#>  1 <split [90/10]> Fold01
#>  2 <split [90/10]> Fold02
#>  3 <split [90/10]> Fold03
#>  4 <split [90/10]> Fold04
#>  5 <split [90/10]> Fold05
#>  6 <split [90/10]> Fold06
#>  7 <split [90/10]> Fold07
#>  8 <split [90/10]> Fold08
#>  9 <split [90/10]> Fold09
#> 10 <split [90/10]> Fold10

Created on 2023-07-27 with reprex v2.0.2

This might be a bit misleading, as the sampling here didn't depend on the y value at all. Would it make sense to drop the strata attribute if only one stratum is created?

Finally, would it make sense for the categorical branch of make_strata to provide a message listing the categories that get "pooled" together, and which stratum they were pooled into? This might help users catch processing mistakes, if they weren't expecting to have any rare classes that would get automatically pooled. This might be too noisy though, and not as useful as warning about "single stratum" cases.

@hfrick
Copy link
Member

hfrick commented Nov 1, 2023

Re warning: We could definitely warn about "single stratum".

Re removing the attribute: I'll look into how that plays out later on and if it's more important to record "state" or "intention" (leaning towards "state").

Re messaging: uh yeah, that could get noisy. Maybe with something like tune's issue cataloger but we'd have to decide if that's worth the effort.

@hfrick hfrick added the feature a feature request or enhancement label Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants