Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Great Idea #1

Open
azev77 opened this issue Apr 28, 2020 · 6 comments
Open

Great Idea #1

azev77 opened this issue Apr 28, 2020 · 6 comments

Comments

@azev77
Copy link

azev77 commented Apr 28, 2020

Hi @kleinschmidt,
I think the Julia ecosystem would benefit from something like this!
If we wanna do serious stats it should be easy to automatically generate all interactions (up order n) etc.

Some things I find particularly useful in my other stats packages outside Julia:

  1. "i.x1" makes x1 into a factor variable in a formula
    Suppose x1 takes the values: 1.2, 5, 6.4
    reg y x1: treats x1 as continuous & returns 1 coef (assuming no intercept)
    reg y i.x1 creates 3 dummies for each level of x1 & returns 3 coefficients
    (if there is an intercept it randomly drops one level unless the user chooses which level to drop)

  2. i.x1#(c.x2 i.x3)
    Interacts all dummies of x1 w/ x2 (continuous)
    Interacts all dummies of x1 w/ all dummies of x3

  3. Leads & Lags. Suppose D is at the state-year level.
    L.D: creates a 1 year lag of D
    L(4).D: creates a 4 year lag of D
    F(4).D: creates a 4 year lead of D. $D_{t+4}$
    reg y F(-1 0 1 2).D
    estimates: y_t =b_{-1} x_{t-1} + b_{0} x_{t} +b_{1} x_{t+1} +b_{2} x_{t+2}

If Julia is to be "as easy for statistics as R" these features should be in StatsModels.
I'd love to help if I can.

@kleinschmidt
Copy link
Owner

Thanks for your kind words! I suspect that many of your needs are already met by StatsModels.jl, even if the syntax is a little different.

For instance, you can pass "hints" when constructing a schema to indicate that you want to treat a variable as a CategoricalTerm. You can also fit interactions like that using the R-style * or & (instead of : which is special in Julia).

The way that StatsModels.jl is designed, any kind of special syntax needs to be in the form of a function call. So using . or other reserved characters like # (or : as in R) is likely a non-starter.

Lead/lag are already supported by statsmodels (you can do lead(x, 3) in a formula), see the docs here: https://juliastats.org/StatsModels.jl/stable/temporal_terms/

@azev77
Copy link
Author

azev77 commented Apr 28, 2020

Thanks! I'm gonna have to study this.
Some early thoughts (recorded here):

  1. I don't see a minimalistic and convenient way of creating dummies here.

  2. I don't see easy ways to drop redundant dummies.
    In Stata: i.agegrp uses default base level of 1, b3.agegrp makes 3 the base level.
    In FixedEffectModels.jl
    reg(df, @formula(Sales ~ YearC); contrasts = Dict(:YearC => DummyCoding(base = 80)))

  3. It looks like
    Julia's & is Stata's #
    Julia's * is Stata's ##
    Julia's (a + b) & c is Stata's (a b) # c

  4. In FixedEffectModels.jl: fe(x1) is Stata's i.x1
    I feel like i.x1 is more convenient and parsimonious than fe(x1).
    I trust you if you say it's a non-starter.
    (does it still have to be a non-starter if it's inside formula()?)

  5. For lag(x, n) I can only do one lag at a time.
    Stata's F(-1 0 1 2).D
    Could be Julia's: lead(x, [-1 0 1 2]) if extended to allow multiple lags/leads

@kleinschmidt
Copy link
Owner

I think it could be really useful to do a "rosetta stone" for people coming from different statistical software backgrounds. I've never used stata or SAS for instance, and I (like many of the people involved with developing StatsModels.jl) have an R background, so many of the design decisions make sense to an R user but could be hard to translate...

  1. You're right, it's a bit clunky. Best way is to make the variable non-numeric in the data (for instance, using categorical! if it's stored in a dataframe...), or to specify contrasts (e.g., glm(f, data, contrasts =Dict(:my_var => DummyCoding())).
  2. If I'm understanding correctly, redundant dummies are dropped by default, if the model context is <: StatisticalModel. See the bits of code in StatsModels.jl that handle FullRank schema types, and there's a bit about it in the docs (here, I think: https://juliastats.org/StatsModels.jl/stable/contrasts/#Further-details-1). Maybe a bit of terminology difference though: for whatever reason I find it more natural to think of "promoting" contrasts to full-dummy (k predictors for a k-valued variable), instead of "dropping" redundant dummies (dropping 1 predictor to have k-1 predictors)
  3. Again, don't know stata so I'm trusting you here :) But there is a 'distributive rule' where (a+b)&c expands to a&c + b&c if that's what you mean...
  4. FixedEffectModels implements a lot of additional stuff on top of "base StatsModels", and fe has a special interpretation that (I think) is slightly different from just creating dummy predictors. But @matthieugomez can give you a better answer there than I can. The "dot" syntax is a non-starter because of a basic design choice, which is that all "special syntax" in a formula has to be a function call (as far as the Julia parser is concerned). That way we can always rely on the normal Julia mechanisms of multiple dispatch to overload syntax with special meaning in particular contexts.
  5. Yup. That's something we'd certainly consider as a PR though! If you really felt like digging into the StatsModels.jl internals :)

@azev77
Copy link
Author

azev77 commented Apr 30, 2020

  1. I do most of my stats in STATA, followed by R. I'm hoping to transition to Julia.
    I'd be happy to contribute to a "Rosetta stone".
    Here is the quantecon cheatsheet for Matlab/Python/Julia.
    The Stats cheatsheet for STATA/Pandas/Base R. This is where Julia can be added.
  2. I'm interested in working on this kind of PR.
    lead(x, [-1 0 1 2]) is very fundamental syntax that should be available to the whole ecosystem.

@azev77
Copy link
Author

azev77 commented Sep 1, 2022

@kleinschmidt
It looks like @eirikbrandsaas implemented many of these in:

https://github.com/eirikbrandsaas/PanelDataTools.jl

@eirikbrandsaas
Copy link

Thanks for the shout out, but to be clear the package only deals with the panel/time series issue of easy leads/lags/diffs creation. (I.e., only point 5 in #1 (comment))

Also, would be good if somebody other than me tried out the package so I know if it works :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants