An illustration of correlation attenuation when discretizing a continuous variable to an ordered categorical variable.
ρ(x,y)=COV(X,Y)σXσY,
where σ is the standard deviation.
# Set correlation between X and Y* to 0.5
rho <- 0.5
# Assume X and Y*~N(0,1) for now
sd_x <- 1
sd_y <- 1
cov_xy <- rho * sd_x * sd_y
# Simulate correlated X and Y*
df <- as.data.frame(
mvrnorm(
n = 1e4,
mu = c(0, 0),
Sigma = matrix(
c(
sd_x^2, cov_xy,
cov_xy, sd_y^2
),
ncol = 2
)
)
)
names(df) <- c("X", "Y*")
# Manually dichotomize Y* to 0 and 1
df <- df %>%
mutate(Y = ifelse(`Y*` > qnorm(0.7, mean(`Y*`), sd(`Y*`)), 1, 0))
# Show proportion of Y
knitr::kable(table(df$Y) / nrow(df),
col.names = c("Label", "Proportion"),
align = "c"
)
Label | Proportion |
---|---|
0 | 0.6988 |
1 | 0.3012 |
# Show correlations between X and Y*, and X and Y
knitr::kable(
cbind(cor(df$X, df$`Y*`), cor(df$X, df$Y)),
col.names = c("$\\rho_{(X, Y*)}$", "$\\rho_{(X, Y)}$"),
align = "c"
)
ρ(X,Y∗) | ρ(X,Y) |
---|---|
0.5110921 | 0.3819812 |
Attenuation Factor=COV(X,Y)COV(X,Y∗)∗√σ2Y∗σ2Y,
# Analytic calculation
thres <- qnorm(0.3)
var_ystar <- 1
var_y <- 0.7 * (1 - 0.7)
attenuation_bi <- dnorm(thres) * sqrt(var_ystar / var_y)
cor_xy_bi <- attenuation_bi * rho
# Simulated results
lat_cor <- cor(df$X, df$`Y*`)
obs_cor <- cor(df$X, df$Y)
att_fac <- (cov(df$X, df$Y) / cov(df$X, df$`Y*`)) * sqrt(var(df$`Y*`) / var(df$Y))
cal_cor <- att_fac * lat_cor
summary_1 <- round(c(rho, attenuation_bi, cor_xy_bi, att_fac, cal_cor), 3)
names(summary_1) <- c(
"Correlation_XY*", "Attenuation_Formula",
"Correlation_Formula", "Attenuation_Data",
"Correlation_Data"
)
knitr::kable(
summary_1,
align = "c",
col.names = " "
)
Correlation_XY* | 0.500 |
Attenuation_Formula | 0.759 |
Correlation_Formula | 0.379 |
Attenuation_Data | 0.747 |
Correlation_Data | 0.382 |
Attenuation Factor=E(XY)−E(X)E(Y)E(XY∗)−E(X)E(Y∗)∗√σ2Y∗σ2Y,
ρ(X,Y∗)=E(XY∗)√σ2Xσ2Y∗=E[(X−μxσX)(Y∗−μY∗σY∗)]=E(XY∗),
given that the correlation of X and Y∗ is their standardized covariance. Then E(XY) can be calculated using thresholds of the categorical variable.
fx,y∗(x,y∗)=12π√1−ρ2∗e−12(1−ρ2)∗[x2+y∗2−2ρxy∗],
and the expected value of XY∗ is:
E(XY∗)=∫∞−∞∫∞−∞xy∗fx,y∗(x,y∗)d(x)d(y∗),
E(XY)=E(X|Y∗>τ),
while E(XY)=E(X|Y∗<τ)=E(X∗0)=0. Further, the formula can be expanded as:
E(X|Y∗>τ)=∫∞−∞∫∞τxy∗fx,y∗(x,y∗)d(x)d(y∗)=∫∞−∞∫∞τx,y∗12π√1−ρ2∗e−12(1−ρ2)∗[x2+y∗2−2ρxy∗]d(x)d(y∗),
E(X|Y∗=a)=∫∞−∞xfX|Y∗(x|Y∗=y∗)d(x)=ρ∗a,
given that fX|Y∗(x|Y∗=y∗)=fX,Y∗(x,y∗)fY∗(y∗), and ρ is the correlation between X and Y∗.
Now we think of E(X|Y∗>a):
E(X|Y∗>a)=E(X−rY∗+rY∗|Y∗>a)COV(X−rY∗,Y∗)=E[(X−rY∗)Y∗]−E(X−rY∗)E(Y∗)COV(X−rY∗,Y∗)=E(XY∗)−rE(Y∗Y∗)−E(X)E(Y∗)+rE(Y∗)E(Y∗),
since we assume that X and Y∗ both follow N (0,1), therefore COV(X,Y∗)=E(XY∗)−E(X)E(Y∗)=r and VAR(Y∗)=E(Y∗)E(Y∗)−E(Y∗Y∗)=1. Then,
COV(X−rY∗,Y∗)=COV(X,Y∗)−r∗1=r−r=0,
We have shown that the correlation between X−rY∗ and Y∗ is 0, and thus E(X−rY∗) is independent of Y∗. Furthermore,
E(X|Y∗>a)=E(X−rY∗)+r∗E(Y∗|Y∗>a)=E(X)−r∗E(Y∗)+r∗E(Y∗|Y∗>a)=r∗E(Y∗|Y∗>a),
According to the definition of conditional expectation,
E(Y∗|Y∗>a)=∫∞ay∗ϕ0(y∗)d(y∗)=e−a22√2π,
note that ϕ0 is the p.d.f. of standard normal distribution.
dnorm()
function in R
calculates the p.d.f. of the normal distribution. For standard normal distribution, the R function dnorm(a)
would return e−a22√2π, which is the value of E(Y∗|Y∗>a). Thus, we are able to show that:E(X|Y∗>a)=ρ∗ϕ0(a)
# Assuming X and Y* ~ N(0,1)
# for standard bivariate normal distribution, E(XY*) = rho
rho <- 0.5
var_ystar <- 1
thres_1 <- qnorm(0.5)
thres_2 <- qnorm(0.5 + 0.3)
thres_3 <- qnorm(0.5 + 0.3 + 0.1)
p_less_than_thres1 <- pnorm(thres_1)
p_thres1_thres2 <- pnorm(thres_2) - pnorm(thres_1)
p_thres2_thres3 <- pnorm(thres_3) - pnorm(thres_2)
p_larger_than_thres3 <- pnorm(thres_3, lower.tail = F)
e_y2 <- 0 * p_less_than_thres1 +
1^2 * p_thres1_thres2 +
2^2 * p_thres2_thres3 +
3^2 * p_larger_than_thres3
e_y <- 1 * p_thres1_thres2 + 2 * p_thres2_thres3 + 3 * p_larger_than_thres3
var_y <- e_y2 - e_y^2
attenuation_cat <- 1 * (dnorm(thres_1) - dnorm(thres_2)) +
2 * (dnorm(thres_2) - dnorm(thres_3)) +
3 * dnorm(thres_3) * sqrt(var_ystar / var_y)
# attenuation_cat <- (0*pbnorm(-Inf, Inf, -Inf, thres_1, 0, 0, 1, 1, 0.5) +
# 1*pbnorm(-Inf, Inf, thres_1, thres_2, 0, 0, 1, 1, 0.5) +
# 2*pbnorm(-Inf, Inf, thres_2, thres_3, 0, 0, 1, 1, 0.5) +
# 3*pbnorm(-Inf, Inf, thres_3, Inf, 0, 0, 1, 1, 0.5))/rho * sqrt(var_ystar/var_y)
cor_xy_cat <- attenuation_cat * rho
rho <- 0.5
sd_x <- 1
sd_y <- 1
cov_xy <- rho * sd_x * sd_y
df3 <- as.data.frame(mvrnorm(
n = 1e4,
mu = c(0, 0),
Sigma = matrix(
c(
sd_x^2, cov_xy,
cov_xy, sd_y^2
),
ncol = 2
)
))
names(df3) <- c("y1", "y2")
# HL: An easier way to do the categorization:
# findInterval(
# df3$y2,
# rightmost.closed = TRUE,
# quantile(df3$y2, c(0, 0.5, 0.5 + 0.3, 0.5 + 0.3 + 0.1, 1))
# ) - 1 # if starting from 0
# The above is based on the sample quantiles without
# assuming normality. If you want to assume normality, try
# findInterval(
# df3$y2,
# rightmost.closed = TRUE,
# qnorm(c(0, 0.5, 0.5 + 0.3, 0.5 + 0.3 + 0.1, 1),
# mean = mean(df3$y2), sd = sd(df3$y2))
# ) - 1
# Could you update the following accordingly? Thanks.
df3$y2_mul <- findInterval(
df3$y2,
rightmost.closed = TRUE,
qnorm(c(0, 0.5, 0.5 + 0.3, 0.5 + 0.3 + 0.1, 1),
mean = mean(df3$y2), sd = sd(df3$y2))
) - 1
lat_cor <- cor(df3$y1, df3$y2)
obs_cor <- cor(df3$y1, df3$y2_mul)
att_fac <- (cov(df3$y1, df3$y2_mul) / cov(df3$y1, df3$y2)) * sqrt(var(df3$y2) / var(df3$y2_mul))
cal_cor <- att_fac * lat_cor
summary_2 <- round(c(rho, attenuation_cat, cor_xy_cat, att_fac, cal_cor), 3)
names(summary_2) <- c(
"Correlation_XY*", "Attenuation_Formula",
"Correlation_Formula", "Attenuation_Data",
"Correlation_Data"
)
knitr::kable(summary_2,
align = "c",
col.names = " "
)
Correlation_XY* | 0.500 |
Attenuation_Formula | 0.865 |
Correlation_Formula | 0.433 |
Attenuation_Data | 0.876 |
Correlation_Data | 0.430 |
E(X,Y)={0,if Y∗≤τ1E(X|Y=1),if τ1≤Y∗≤τ2E(X|Y=2),if τ2≤Y∗≤τ3E(X|Y=3),if Y∗>τ3
Take one category as one example, for E(X|Y=1) with τ1≤Y∗≤τ2 and c=1: E(X|Y=1)=∫∞−∞∫τ2τ1xy∗f(x,y∗)dxdy∗
fx,y∗(x,y∗)=12πσxσy∗√1−ρ2∗e−12(1−ρ2)∗[z2x+z2y∗−2ρzxzy∗],
and transform the formula to:
E(X|Y=1)=∫∞−∞∫τ2τ1xy∗dxdy∗2πσxσy∗√1−ρ2∗e−12(1−ρ2)∗[z2x+z2y∗−2ρzxzy∗]d(x)d(y∗),
Because of the property of derivation,
d(x)σx=d(x−μxσx)=d(zx),
and it is the same for d(zy∗).
E(X|Y=1)=∫∞−∞∫τ2−μy∗σy∗τ1−μy∗σy∗xy∗12π√1−ρ2∗e−12(1−ρ2)∗[z2x+z2y∗−2ρzxzy∗]d(zx)d(zy∗)
rho <- 0.5
sd_x <- rnorm(1, 1, 0.5)
sd_y <- rnorm(1, 1.5, 0.3)
cov_xy <- rho * sd_x * sd_y
df2 <- as.data.frame(mvrnorm(
n = 1e7,
mu = c(rnorm(1, 10, 2.1), rnorm(1, 8, 1.1)),
Sigma = matrix(
c(
sd_x^2, cov_xy,
cov_xy, sd_y^2
),
ncol = 2
)
))
names(df2) <- c("y1", "y2")
df2 <- df2 %>%
mutate(y2_cat = ifelse(y2 > qnorm(0.7, mean(df2$y2), sd(df2$y2)), 1, 0))
lat_cor <- cor(df2$y1, df2$y2)
obs_cor <- cor(df2$y1, df2$y2_cat)
att_fac <- (cov(df2$y1, df2$y2_cat) / cov(df2$y1, df2$y2)) * sqrt(var(df2$y2) / var(df2$y2_cat))
cal_cor <- att_fac * lat_cor
summary_3 <- round(c(rho, attenuation_bi, cor_xy_bi, att_fac, cal_cor), 3)
names(summary_3) <- c(
"Correlation_XY*", "Attenuation_Formula",
"Correlation_Formula", "Attenuation_Data",
"Correlation_Data"
)
knitr::kable(summary_3,
align = "c",
col.names = " "
)
Correlation_XY* | 0.500 |
Attenuation_Formula | 0.759 |
Correlation_Formula | 0.379 |
Attenuation_Data | 0.759 |
Correlation_Data | 0.379 |
rho <- 0.5
sd_x <- rnorm(1, 1, 0.5)
sd_y <- rnorm(1, 1.5, 0.3)
cov_xy <- rho * sd_x * sd_y
df3 <- as.data.frame(mvrnorm(
n = 1e7,
mu = c(rnorm(1, 10, 2.1), rnorm(1, 8, 1.1)),
Sigma = matrix(
c(
sd_x^2, cov_xy,
cov_xy, sd_y^2
),
ncol = 2
)
))
names(df3) <- c("y1", "y2")
df3 <- df3 %>%
mutate(y2_mul = ifelse(y2 < qnorm(0.5, mean(df3$y2), sd(df3$y2)), 0,
ifelse(qnorm(0.5, mean(df3$y2), sd(df3$y2)) < y2 & y2 < qnorm(0.5 + 0.3, mean(df3$y2), sd(df3$y2)), 1,
ifelse(qnorm(0.5 + 0.3, mean(df3$y2), sd(df3$y2)) < y2 & y2 < qnorm(0.5 + 0.3 + 0.1, mean(df3$y2), sd(df3$y2)), 2,
ifelse(y2 > qnorm(0.5 + 0.3 + 0.1, mean(df3$y2), sd(df3$y2)), 3, NA)
)
)
))
lat_cor <- cor(df3$y1, df3$y2)
obs_cor <- cor(df3$y1, df3$y2_mul)
att_fac <- (cov(df3$y1, df3$y2_mul) / cov(df3$y1, df3$y2)) * sqrt(var(df3$y2) / var(df3$y2_mul))
cal_cor <- att_fac * lat_cor
summary_4 <- round(c(rho, attenuation_cat, cor_xy_cat, att_fac, cal_cor), 3)
names(summary_4) <- c(
"Correlation_XY*", "Attenuation_Formula",
"Correlation_Formula", "Attenuation_Data",
"Correlation_Data"
)
knitr::kable(summary_4,
align = "c",
col.names = " "
)
Correlation_XY* | 0.500 |
Attenuation_Formula | 0.865 |
Correlation_Formula | 0.433 |
Attenuation_Data | 0.872 |
Correlation_Data | 0.436 |
For attribution, please cite this work as
Zhang (2023, April 19). Measurement & Multilevel Modeling Lab: Correlation Attenuation for Categorical Variables. Retrieved from https://mmmlab.rbind.io/posts/2023-04-19-correlation-attenuation-for-categorical-variables/
BibTeX citation
@misc{zhang2023correlation, author = {Zhang, Gengrui (Jimmy)}, title = {Measurement & Multilevel Modeling Lab: Correlation Attenuation for Categorical Variables}, url = {https://mmmlab.rbind.io/posts/2023-04-19-correlation-attenuation-for-categorical-variables/}, year = {2023} }