Great Idea #1

azev77 · 2020-04-28T18:41:50Z

Hi @kleinschmidt,
I think the Julia ecosystem would benefit from something like this!
If we wanna do serious stats it should be easy to automatically generate all interactions (up order n) etc.

Some things I find particularly useful in my other stats packages outside Julia:

"i.x1" makes x1 into a factor variable in a formula
Suppose x1 takes the values: 1.2, 5, 6.4
reg y x1: treats x1 as continuous & returns 1 coef (assuming no intercept)
reg y i.x1 creates 3 dummies for each level of x1 & returns 3 coefficients
(if there is an intercept it randomly drops one level unless the user chooses which level to drop)
i.x1#(c.x2 i.x3)
Interacts all dummies of x1 w/ x2 (continuous)
Interacts all dummies of x1 w/ all dummies of x3
Leads & Lags. Suppose D is at the state-year level.
L.D: creates a 1 year lag of D
L(4).D: creates a 4 year lag of D
F(4).D: creates a 4 year lead of D. $D_{t+4}$
reg y F(-1 0 1 2).D
estimates: y_t =b_{-1} x_{t-1} + b_{0} x_{t} +b_{1} x_{t+1} +b_{2} x_{t+2}

If Julia is to be "as easy for statistics as R" these features should be in StatsModels.
I'd love to help if I can.

The text was updated successfully, but these errors were encountered:

kleinschmidt · 2020-04-28T19:35:27Z

Thanks for your kind words! I suspect that many of your needs are already met by StatsModels.jl, even if the syntax is a little different.

For instance, you can pass "hints" when constructing a schema to indicate that you want to treat a variable as a CategoricalTerm. You can also fit interactions like that using the R-style * or & (instead of : which is special in Julia).

The way that StatsModels.jl is designed, any kind of special syntax needs to be in the form of a function call. So using . or other reserved characters like # (or : as in R) is likely a non-starter.

Lead/lag are already supported by statsmodels (you can do lead(x, 3) in a formula), see the docs here: https://juliastats.org/StatsModels.jl/stable/temporal_terms/

azev77 · 2020-04-28T21:03:04Z

Thanks! I'm gonna have to study this.
Some early thoughts (recorded here):

I don't see a minimalistic and convenient way of creating dummies here.
I don't see easy ways to drop redundant dummies.
In Stata: i.agegrp uses default base level of 1, b3.agegrp makes 3 the base level.
In FixedEffectModels.jl
reg(df, @formula(Sales ~ YearC); contrasts = Dict(:YearC => DummyCoding(base = 80)))
It looks like
Julia's & is Stata's #
Julia's * is Stata's ##
Julia's (a + b) & c is Stata's (a b) # c
In FixedEffectModels.jl: fe(x1) is Stata's i.x1
I feel like i.x1 is more convenient and parsimonious than fe(x1).
I trust you if you say it's a non-starter.
(does it still have to be a non-starter if it's inside formula()?)
For lag(x, n) I can only do one lag at a time.
Stata's F(-1 0 1 2).D
Could be Julia's: lead(x, [-1 0 1 2]) if extended to allow multiple lags/leads

kleinschmidt · 2020-04-29T01:17:35Z

I think it could be really useful to do a "rosetta stone" for people coming from different statistical software backgrounds. I've never used stata or SAS for instance, and I (like many of the people involved with developing StatsModels.jl) have an R background, so many of the design decisions make sense to an R user but could be hard to translate...

You're right, it's a bit clunky. Best way is to make the variable non-numeric in the data (for instance, using categorical! if it's stored in a dataframe...), or to specify contrasts (e.g., glm(f, data, contrasts =Dict(:my_var => DummyCoding())).
If I'm understanding correctly, redundant dummies are dropped by default, if the model context is <: StatisticalModel. See the bits of code in StatsModels.jl that handle FullRank schema types, and there's a bit about it in the docs (here, I think: https://juliastats.org/StatsModels.jl/stable/contrasts/#Further-details-1). Maybe a bit of terminology difference though: for whatever reason I find it more natural to think of "promoting" contrasts to full-dummy (k predictors for a k-valued variable), instead of "dropping" redundant dummies (dropping 1 predictor to have k-1 predictors)
Again, don't know stata so I'm trusting you here :) But there is a 'distributive rule' where (a+b)&c expands to a&c + b&c if that's what you mean...
FixedEffectModels implements a lot of additional stuff on top of "base StatsModels", and fe has a special interpretation that (I think) is slightly different from just creating dummy predictors. But @matthieugomez can give you a better answer there than I can. The "dot" syntax is a non-starter because of a basic design choice, which is that all "special syntax" in a formula has to be a function call (as far as the Julia parser is concerned). That way we can always rely on the normal Julia mechanisms of multiple dispatch to overload syntax with special meaning in particular contexts.
Yup. That's something we'd certainly consider as a PR though! If you really felt like digging into the StatsModels.jl internals :)

azev77 · 2020-04-30T19:24:49Z

I do most of my stats in STATA, followed by R. I'm hoping to transition to Julia.
I'd be happy to contribute to a "Rosetta stone".
Here is the quantecon cheatsheet for Matlab/Python/Julia.
The Stats cheatsheet for STATA/Pandas/Base R. This is where Julia can be added.
I'm interested in working on this kind of PR.
lead(x, [-1 0 1 2]) is very fundamental syntax that should be available to the whole ecosystem.

azev77 · 2022-09-01T05:10:23Z

@kleinschmidt
It looks like @eirikbrandsaas implemented many of these in:

https://github.com/eirikbrandsaas/PanelDataTools.jl

eirikbrandsaas · 2022-09-01T17:36:57Z

Thanks for the shout out, but to be clear the package only deals with the panel/time series issue of easy leads/lags/diffs creation. (I.e., only point 5 in #1 (comment))

Also, would be good if somebody other than me tried out the package so I know if it works :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Great Idea #1

Great Idea #1

azev77 commented Apr 28, 2020 •

edited

Loading

kleinschmidt commented Apr 28, 2020

azev77 commented Apr 28, 2020 •

edited

Loading

kleinschmidt commented Apr 29, 2020

azev77 commented Apr 30, 2020

azev77 commented Sep 1, 2022

eirikbrandsaas commented Sep 1, 2022

Great Idea #1

Great Idea #1

Comments

azev77 commented Apr 28, 2020 • edited Loading

kleinschmidt commented Apr 28, 2020

azev77 commented Apr 28, 2020 • edited Loading

kleinschmidt commented Apr 29, 2020

azev77 commented Apr 30, 2020

azev77 commented Sep 1, 2022

eirikbrandsaas commented Sep 1, 2022

azev77 commented Apr 28, 2020 •

edited

Loading

azev77 commented Apr 28, 2020 •

edited

Loading