Running regression with GLM.jl in Julia
Before we start, I encourage you to uncomment the codes below and install the required packages that you do not have for this tutorial.
using Plots, StatsPlots
#= uncomment below to install packages
using Pkg
Pkg.add(["RDatasets", "Pipe", "DataFrames", "GLM",
"Plots", "StatPlots"])
=#
Prior to running regression analyses, let me briefly go over three
essential packages, Pipe
, RDatasets
,
DataFrames
, in handling data frames in Julia.
Pipe
The package Pipe
improves the base pipe operator
(|>
) in Julia. It allows you to represent the object
that is piped from with an underscore _
in the functions
that follow. For example, sqrt(sum([1, 2, 3]))
involve two
commands – one applied to [1, 2, 3]
and the other one
applied to sum([1, 2, 3])
. We can pipe these commands,
stemming from the [1, 2, 3]
object, using
@pipe
together with |>
, as illustrated
below.
using Pipe
@pipe [1, 2, 3] |> sum(_) |> sqrt(_)
2.449489742783178
The above code might have overkilled this simple task of taking a
sqaure root of a sum, but piping will become very useful in handling a
long series of commands. For more details of Pipe
, please
read the julia tutorial on [pipe] (https://syl1.gitbook.io/julia-language-a-concise-tutorial/useful-packages/pipe).
RDatasets
The package RDatasets
provides most of the base R datasets for Julia users to play around. We
will be using the dataset bfi
from the R package
psych
in this tutorial to explore regression analyses in
Julia.
Let’s read in the dataset bfi
and drop out missing
values.
using RDatasets
# show(RDatasets.datasets("psych"), allrows=true)
= @pipe dataset("psych", "bfi") |> dropmissing(_)
bfi println(names(bfi))
["Variable", "A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", "E
1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1", "O2", "O3",
"O4", "O5", "Gender", "Education", "Age"]
For a complete R documentation of this dataset, please refer to [this] (https://www.personality-project.org/r/html/bfi.html).
DataFrames
For simplicity, let’s wrangle our data and sum over the scores on the
columns for agreeableness (A1
to A5
) and for
contienciousness (C1
to C5
).
Gender
is a numeric variable in the original dataset, while
it should better be treated as a binary variable. Let’s also convert it
into a categorical array using categorical
.
transform()
has similar function as
mutate()
in R, with which we can transform a column and
save it into a new column in a dataframe. The operator
=>
is pretty intuitive. In the first transform statment
down below, we are basically asking Julia to take the columns between
A1
and A5
in bfi
, transform them
by summing scores on the same row (=> ByRow(+)
), and
then save these transformed data into a new column called
Asum
.
Note that to call a column in a dataframe, we need to place an colon
:
before the column name.
using DataFrames
= @pipe bfi |>
bfi transform(_, Between(:A1, :A5) => ByRow(+) => :Asum) |>
transform(_, Between(:C1, :C5) => ByRow(+) => :Csum) |>
transform(_, :Gender => categorical => :Gender_bin)
describe(select(bfi, Between(:Asum, :Gender_bin)))
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
⋯
│ Symbol Union… Any Union… Any Int64 DataType
⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
1 │ Asum 21.1632 5 22.0 30 0 Int64
⋯
2 │ Csum 19.0501 5 19.0 30 0 Int64
3 │ Gender_bin 1 2 0 CategoricalValue{In
t64
1 column om
itted
describe()
provides summary statistics of the selected
columns of our newly created variables from Asum
to
Gender_bin
. As can be seen in the summary table,
Gender_bin
has a data type of
CategoricalValue{Int64,UInt32}
. Through that way in our
following regression analysis, gender will be treated as a categorical
variable and automatically dummy coded.
For data wrangling in Julia using the package
Dataframes
, a [cheatsheet] (https://ahsmart.com/assets/pages/data-wrangling-with-data-frames-jl-cheat-sheet/DataFramesCheatSheet_v0.22_rev1.pdf)
may come in handy here. I am a beginner Julia and
Dataframes
user. If you have a better way to optimize my
codes, please let me know and I will be more than happy to learn it!
Below are some summary statistics and a snapshot of the variables of interest for the following analyses.
describe(select(bfi, Between(:Education, :Gender_bin)))
5×7 DataFrame
Row │ variable mean min median max nmissing eltype
⋯
│ Symbol Union… Any Union… Any Int64 DataType
⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
1 │ Education 3.19141 1 3.0 5 0 Int64
⋯
2 │ Age 29.5103 3 26.0 86 0 Int64
3 │ Asum 21.1632 5 22.0 30 0 Int64
4 │ Csum 19.0501 5 19.0 30 0 Int64
5 │ Gender_bin 1 2 0 CategoricalValue{In
t64 ⋯
1 column om
itted
first(select(bfi, Between(:Education, :Gender_bin)), 5)
5×5 DataFrame
Row │ Education Age Asum Csum Gender_bin
│ Int64 Int64 Int64 Int64 Cat…
─────┼────────────────────────────────────────────
1 │ 3 21 28 22 2
2 │ 2 19 14 15 1
3 │ 1 21 24 17 1
4 │ 1 17 14 19 1
5 │ 5 68 23 18 1
In our sample, do older people tend to be more agreeable than younger
people? We can fit a linear model with Age
as the predictor
and Asum
, the sum scores of agreeableness, as the outcome
variable, and determine whether there is a positive association between
these two variables.
The package GLM
provides convenient ways that are analogous to the R functions
lm()
and glm()
to specify linear models in
Julia.
To build a linear model, we use lm()
, specify the model
equation within @formula()
, and indicate the dataset after
the formula.
using GLM
= lm(@formula(Asum ~ Age), bfi) lm1
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
Asum ~ 1 + Age
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 20.4382 0.225823 90.51 <1e-99 19.9954 20.8811
Age 0.0245685 0.00719702 3.41 0.0007 0.010455 0.0386821
───────────────────────────────────────────────────────────────────────────
Unlike R, we do not need to additionally summarize lm1
to get a summary table of the outputs.
Equivalent to lm()
, we can use the fit()
function with the LinearModel
argument.
= fit(LinearModel, @formula(Asum ~ Age), bfi) fit1
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
Asum ~ 1 + Age
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 20.4382 0.225823 90.51 <1e-99 19.9954 20.8811
Age 0.0245685 0.00719702 3.41 0.0007 0.010455 0.0386821
───────────────────────────────────────────────────────────────────────────
lm(...)
and fit(LinearModel, ...)
do not
differ in their type, output, and functionality. A quick check on their
type:
typeof(lm1)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
typeof(fit1)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
which are essentially the same.
We can use coef()
, stderror()
, and
vcov()
to extract coefficient estimates and standard errors
of the coefficents, and the estimated variance-covariance matrix of the
coefficient estimates, respectively.
coef(lm1)
2-element Array{Float64,1}:
20.43821402633636
0.024568514618654096
stderror(lm1)
2-element Array{Float64,1}:
0.22582269777999403
0.007197020298763872
vcov(lm1)
2×2 Array{Float64,2}:
0.0509959 -0.00152855
-0.00152855 5.17971e-5
To obtain statistics for model fit, we can use r2
for
R^2
and deviance()
for the weighted residual
sum of squares.
r2(lm1)
0.005189310057134855
deviance(lm1)
29411.99403563621
predict()
computes the predicted values of each
individual in our sample.
predict(lm1)
2236-element Array{Float64,1}:
20.9541528333281
20.905015804090787
20.9541528333281
20.85587877485348
22.10887302040484
21.10156392104002
20.880447289472134
20.92958431870944
21.69120827188772
21.248975008751945
⋮
21.15070095027733
21.248975008751945
21.003289862565406
20.978721347946752
20.978721347946752
21.02785837718406
21.15070095027733
21.199837979514637
21.666639757269067
!!! note “Follow-up Practice (1)” Let’s try to run a linear model to investigate whether contiouenciousness is predicted by education. How much variance is explained by this model?
We may suspect that the difference in agreeableness over years of age
depend on someone’s education level. To investigate this, we can add an
interation between Age
and Education
on
Asum
by Age & Education
. Typically when we
add an interaction term to a model, we include also the main effects of
the variables. A shorthand of specifying both the main effects and
interaction between the variables is Age * Education
.
# the models below are equivalent
lm(@formula(Asum ~ Age + Education + Age & Education), bfi)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
Asum ~ 1 + Age + Education + Age & Education
Coefficients:
───────────────────────────────────────────────────────────────────────────
──────
Coef. Std. Error t Pr(>|t|) Lower 95% Upp
er 95%
───────────────────────────────────────────────────────────────────────────
──────
(Intercept) 19.4241 0.651563 29.81 <1e-99 18.1463 20.70
18
Age 0.0831532 0.0211002 3.94 <1e-4 0.0417752 0.12
4531
Education 0.273547 0.200684 1.36 0.1730 -0.119999 0.66
7093
Age & Education -0.016346 0.0061294 -2.67 0.0077 -0.0283659 -0.00
432607
───────────────────────────────────────────────────────────────────────────
──────
lm(@formula(Asum ~ Age * Education), bfi)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
Asum ~ 1 + Age + Education + Age & Education
Coefficients:
───────────────────────────────────────────────────────────────────────────
──────
Coef. Std. Error t Pr(>|t|) Lower 95% Upp
er 95%
───────────────────────────────────────────────────────────────────────────
──────
(Intercept) 19.4241 0.651563 29.81 <1e-99 18.1463 20.70
18
Age 0.0831532 0.0211002 3.94 <1e-4 0.0417752 0.12
4531
Education 0.273547 0.200684 1.36 0.1730 -0.119999 0.66
7093
Age & Education -0.016346 0.0061294 -2.67 0.0077 -0.0283659 -0.00
432607
───────────────────────────────────────────────────────────────────────────
──────
!!! note “Follow-up Practice (2)” Let’s fit a model to explore
whether there is an interaction between education and gender
(Gender_bin
) on contienciousness. How much did the variance
explained by the model increasecompared to the model without the
interaction (in reference to Practice (1)) increase?
For the illustrative purpose, I arbitrarily define that a sum score
above 20 out of 25 is a high score of agreeableness and dichotomize
Asum
to 1 or 0 accordingly.
= @pipe transform(bfi,
bfi :Asum => ByRow(function(x)
if x >= 20 1
else 0
end
end)
=> :Abin);
describe(select(bfi, :Abin))
1×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Int64 Float64 Int64 Int64 DataType
─────┼──────────────────────────────────────────────────────────────
1 │ Abin 0.70975 0 1.0 1 0 Int64
Now because our outcome variable is a binary variable, it is more appropriate to model it with a binomial distribution and a probit link.
= glm(@formula(Abin ~ Age), bfi, Binomial(), ProbitLink()) lm_probit
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Arr
ay{Float64,1},Distributions.Binomial{Float64},GLM.ProbitLink},GLM.DensePred
Chol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float
64,2}}
Abin ~ 1 + Age
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.319024 0.0832706 3.83 0.0001 0.155816 0.482231
Age 0.00798517 0.00269249 2.97 0.0030 0.00270799 0.0132623
──────────────────────────────────────────────────────────────────────────
An equivalent way to specify this probit model is
fit(GeneralizedLinearModel, @formula(Abin ~ Age), bfi, Binomial(), ProbitLink())
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Arr
ay{Float64,1},Distributions.Binomial{Float64},GLM.ProbitLink},GLM.DensePred
Chol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float
64,2}}
Abin ~ 1 + Age
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 0.319024 0.0832706 3.83 0.0001 0.155816 0.482231
Age 0.00798517 0.00269249 2.97 0.0030 0.00270799 0.0132623
──────────────────────────────────────────────────────────────────────────
Our probit model is as follows. \[ \Phi^{-1}(\text{High_Agreeableness}) = 0.319 + 0.008 \text{Age} \]
For a person of age 20, the predicted probit of scoring high on agreeableness is 0.479.
Note that the coefficients are probability z-scores and the
prediction is a probit. To translate a probit back to a probability, we
could use cdf()
in the Distributions
package.
Normal()
by default specifies a normal distribution with
a mean of 0 and a standard deviation of 1.
using Distributions
cdf(Normal(), .479)
0.6840306856730872
The predicted probability of someone at age 20 scoring high on agreeablness is 68.4%.
An alternative way to handle binary outcome variables is through logistic regression.
= glm(@formula(Abin ~ Age), bfi, Binomial(), LogitLink()) lm_logit
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Arr
ay{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredC
hol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float6
4,2}}
Abin ~ 1 + Age
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.497558 0.139339 3.57 0.0004 0.224458 0.770657
Age 0.0135925 0.00455195 2.99 0.0028 0.00467084 0.0225141
─────────────────────────────────────────────────────────────────────────
Similarly, an equivalent way to specify this logistic model is
fit(GeneralizedLinearModel, @formula(Abin ~ Age), bfi, Binomial(), LogitLink())
StatsModels.TableRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Arr
ay{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredC
hol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float6
4,2}}
Abin ~ 1 + Age
Coefficients:
─────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept) 0.497558 0.139339 3.57 0.0004 0.224458 0.770657
Age 0.0135925 0.00455195 2.99 0.0028 0.00467084 0.0225141
─────────────────────────────────────────────────────────────────────────
Now our logistic model is as follows. \[ logit(\text{High_Agreeableness}) = 0.498 + 0.014 \text{Age} \]
For a person of age 20, the predicted logit of scoring high on agreeableness is 0.778. Now the prediction is represented in terms of logit. To make it more interpretable, we can simply exponentiate the logit to get the odds ratio, or use the following formula to get the probability:
\[ \hat{\pi} = \frac{exp(0.498 + 0.014)}{1 + exp(0.498 + 0.014)} \]
where \hat{\pi}
is the predicted probability of someone
scoring high on agreeableness.
exp(.778)/(1 + exp(.778))
0.6852489081587367
Consistent with the result from probit regression, the predicted probability of someone at age 20 scoring high on agreeablness is 68.5%.
Let’s cover the answers of Follow-up Practice (1) and (2) here.
# Practice (1)
= fit(LinearModel, @formula(Csum ~ Education), bfi) lm3
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
Csum ~ 1 + Education
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 18.5942 0.176512 105.34 <1e-99 18.2481 18.9403
Education 0.14285 0.0522318 2.73 0.0063 0.0404219 0.245278
──────────────────────────────────────────────────────────────────────────
# Practice (2)
= fit(LinearModel, @formula(Csum ~ Education*Gender_bin), bfi) lm4
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1
}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2
}}}},Array{Float64,2}}
Csum ~ 1 + Education + Gender_bin + Education & Gender_bin
Coefficients:
───────────────────────────────────────────────────────────────────────────
──────────────
Coef. Std. Error t Pr(>|t|) Lower
95% Upper 95%
───────────────────────────────────────────────────────────────────────────
──────────────
(Intercept) 18.9793 0.287144 66.10 <1e-99 18.4162
19.5424
Education 0.0693464 0.0844421 0.82 0.4116 -0.0962
469 0.23494
Gender_bin: 2 -0.607298 0.364028 -1.67 0.0954 -1.3211
7 0.106571
Education & Gender_bin: 2 0.119855 0.10743 1.12 0.2647 -0.0908
195 0.330529
───────────────────────────────────────────────────────────────────────────
──────────────
Note that Gender_bin
is dummy coded with male (1) as 0
and female (2) as 1. The default in GLM
is dummy coding,
but we can specify a different contrast coding. More details on contrast
coding can be found [here] (https://juliastats.org/StatsModels.jl/stable/contrasts/).
# extracting coefficients
= coef(lm3);
lm3_b0, lm3_b1 = coef(lm4); lm4_b0, lm4_b1, lm4_b2, lm4_b3
We will use the package Plots
for
visualizing the relationship between variables in the fitted models. The
package StatsPlots
is an extension of Plots
that supports plotting with
DataFrame
objects.
@df bfi scatter(
:Education,
:Csum,
= :Gender,
group = :bottomright,
legend = ["Male" "Female"],
label = "Education",
xlabel = "Contientiousness")
ylabel plot!((x) -> lm3_b0 + lm3_b1 * x, 1, 5, label = "Linear Model")
@df bfi scatter(
:Education,
:Csum,
= :Gender,
group = :bottomright,
legend = ["Male" "Female"],
label = "Education",
xlabel = "Contienciousness",
ylabel =(690,460))
sizeplot!((x) -> lm4_b0 + lm4_b1 * x + lm4_b2 * 0 + lm4_b2 * x * 0,
1, 5, label = "Male", linecolor = "blue")
plot!((x) -> lm4_b0 + lm4_b1 * x + lm4_b2 * 1 + lm4_b2 * x * 1,
1, 5, label = "Female", linecolor = "red")
I also made reference to this webpage to create the plots above.
Perhaps because the “research questions” I posed here were not so interesting, there was not much interesting to observe from the analysis outputs and the plots. But I hope you enjoy this tutorial in doing regression analyses with Julia :)
For attribution, please cite this work as
Tse (2021, Feb. 11). Measurement & Multilevel Modeling Lab: Regression with Julia!. Retrieved from https://mmmlab.rbind.io/posts/2021-02-09-julia-regression/
BibTeX citation
@misc{tse2021regression, author = {Tse, Winnie Wing-Yee}, title = {Measurement & Multilevel Modeling Lab: Regression with Julia!}, url = {https://mmmlab.rbind.io/posts/2021-02-09-julia-regression/}, year = {2021} }