Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird edge case for check_heteroskedasticity plots #408

Open
Tracked by #643
mattansb opened this issue Mar 24, 2022 · 6 comments
Open
Tracked by #643

Weird edge case for check_heteroskedasticity plots #408

mattansb opened this issue Mar 24, 2022 · 6 comments
Assignees
Labels
3 investigators ❔❓ Need to look further into this issue

Comments

@mattansb
Copy link
Member

What's this weirdness?

library(performance)
#> Warning: package 'performance' was built under R version 4.1.3
library(see)

set.seed(1)
x <- rpois(360, 1.7)
y <- x + rnorm(length(x))

m <- lm(y ~ x)

plot(check_heteroskedasticity(m))
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : pseudoinverse used at -0.068209
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : neighborhood radius 2.0697
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : reciprocal condition number 1.9085e-015
#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
#> parametric, : There are other near singularities as well. 4.1579
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
#> -0.068209
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
#> 2.0697
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
#> number 1.9085e-015
#> Warning in predLoess(object$y, object$x, newx = if
#> (is.null(newdata)) object$x else if (is.data.frame(newdata))
#> as.matrix(model.frame(delete.response(terms(object)), : There are other near
#> singularities as well. 4.1579

plot(x, y)

Created on 2022-03-24 by the reprex package (v2.0.1)

@bwiernik bwiernik self-assigned this Mar 24, 2022
@strengejacke strengejacke added the 3 investigators ❔❓ Need to look further into this issue label Mar 24, 2022
@strengejacke
Copy link
Member

Could be related to #642, where @bwiernik suggested using a different smooth function for "non-continuous" scales.

@bwiernik
Copy link
Contributor

Yeah

@strengejacke
Copy link
Member

For these plots, we use fitted() for the x-axis, and scaled residuals for the y-axis. But when do we decide whether the x-axis is "categorical"? E.g., adding a continuous variable to the model makes the plot looking much more "usual":

set.seed(1)
d <- data.frame(x = rpois(360, 1.7), x2 = rnorm(360))
d$y <- d$x + rnorm(length(d$x))

m <- lm(y ~ x + x2, data = d)
performance::check_heteroscedasticity(m) |> plot()

fitted() in the above case return 360 unique values, the same as the number of observations.

For Mattan's example, fitted() returns 27 unique values, much less than the 360 observations:

set.seed(1)
d <- data.frame(x = rpois(360, 1.7))
d$y <- d$x + rnorm(length(d$x))

m <- lm(y ~ x, data = d)
length(unique(fitted(m)))
#> [1] 27

We must either think of a way how to determine the "spread" of data points across the x axis (even in the first example, they all "spread" around integer values), or whether we want to have at least x% of unique values for the fitted values compared to nobs.

@bwiernik
Copy link
Contributor

I think that plot is fine, even though it's sort of clustered

I think if it's either of these cases:

  1. It's a discrete model like Poisson or Binomial or Negative Binomial or ordinal (though I think we already have a different plot for binomial)
  2. The number of discrete fitted values is "small", maybe 10 or fewer?

And maybe let's have an argument that can be set to force one form or the other?

@mattansb
Copy link
Member Author

I think if it's either of these cases:

  1. It's a discrete model like Poisson or Binomial or Negative Binomial or ordinal (though I think we already have a different plot for binomial)

But a discrete model ≠ discrete predictions (fitted values), so is this necessary?

@bwiernik
Copy link
Contributor

Yeah actually thinking about it, the homogeneity of variance plot really only applies to Gaussian models.

Maybe we detect based on the predictors all being factors and/binary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 investigators ❔❓ Need to look further into this issue
Projects
None yet
Development

No branches or pull requests

3 participants