はじめに

Question: 研究参加者の約50％で発生しているbinaryのイベントをモデリングしたい。どんな手法を使えばよいだろうか？
問題意識: 思考停止の「ロジスティック回帰」はまずいかもしれない。オッズ比とリスク比の乖離が起こりうる。

共著の論文でこうした問題に出会ったので、そこでの勉強内容をシェアします。(要は前回の続き)

(2022/8/5追記: 関数を使わない書き方をしていたので、いくつか編集しました)

まとめると…

リスク比が1より大きく、2値のイベントの発生割合が大きい場合、オッズ比はリスク比よりも大きくなる
オッズ比よりもリスク比の方が疫学的な関心が高いので、オッズ比とリスク比の乖離は問題となる。したがって、イベントが多い状況でロジスティック回帰を用いることは、慎重に考えるべき。
修正ポアソン回帰とlog-binomial regression (対数二項回帰) が代替候補。

リスク比とオッズ比が乖離する様子を細かく知りたい方は、まずこちらをお読みください。
稀でないイベントにおけるリスク比とオッズ比

準備

pacman::p_load(tidyverse, epiR, epitools)

データ作り

自由にデータを操作できるように、このセクションで2×2表のデータを作っていきます。

デフォルトの設定は以下の通りです。

N: 2000人
イベント発生割合: 50%
曝露群の割合: 25%
曝露のリスク比: 2.0倍
共変量として性別が存在する

# データセットを作る関数
## n: 研究参加者数、p_event: イベント発生割合、p_exposure: 曝露群の割合、
## rr_exposure: 曝露によるリスク比 (横断研究なら有病率比ですが)、rr_sex: 女性の男性に対するリスク比
MakeData <- function(n=2000, p_event=0.5, p_exposure=0.25, rr_exposure=2.0, rr_sex=0.5){
  set.seed(123)
  baseline_risk <- n*p_event / (n*(1-p_exposure)+n*p_exposure*rr_exposure) # risk at non-exposed group
  ID <- c(1:n)
  sex <- c(rep(0, n*0.6), rep(1, n*0.4)) # male=0, female=1
  df <- data.frame(ID = ID, sex = sex)

  df <- df %>% 
    mutate(exposure = rbinom(n = nrow(df), size = 1, prob = p_exposure-sex*0.15)) %>% 
    mutate(risk = case_when(
      exposure == 0 & sex == 0 ~ baseline_risk,
      exposure == 0 & sex == 1 ~ baseline_risk*rr_sex,
      exposure == 1 & sex == 0 ~ baseline_risk*rr_exposure,
      exposure == 1 & sex == 1 ~ baseline_risk*rr_exposure*rr_sex
    )) %>% 
    mutate(outcome = rbinom(n = nrow(df), size = 1, prob = risk))
  
  df <- select(df, ID, sex, exposure, outcome)
  
  return(df)
}

df <- MakeData()

作成したデータセットはこの通りです。上から3行だけ提示しますが、指定した人数分入っています。

# データセットの確認
head(df, 3)

##   ID sex exposure outcome
## 1  1   0        0       0
## 2  2   0        1       1
## 3  3   0        0       0

イベント発生割合は0.39になりました。 (乱数の関係上、指定した値ぴったりではない)

# イベント発生割合の確認
mean(df$outcome)

## [1] 0.3895

データのチェック

データを整理しつつ、効果の指標を確認します。

tab <- xtabs(~ exposure + outcome + sex, data = df)

# epi.2by2で扱えるように変換...このあたり、詳しく知らないのでコードが汚い
men <- matrix(c(tab[2,2,1], tab[2,1,1], tab[1,2,1], tab[1,1,1]),
              nrow = 2, byrow = TRUE)
women <- matrix(c(tab[2,2,2], tab[2,1,2], tab[1,2,2], tab[1,1,2]),
              nrow = 2, byrow = TRUE)
dat <- array(c(men, women), dim = c(2,2,2))

crosstable <- epi.2by2(dat, method = "cohort.count")
crosstable

##              Outcome +    Outcome -      Total        Inc risk *        Odds
## Exposed +          250          113        363              68.9       2.212
## Exposed -          529         1108       1637              32.3       0.477
## Total              779         1221       2000              39.0       0.638
## 
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio (crude)                         2.13 (1.93, 2.35)
## Inc risk ratio (M-H)                           1.88 (1.70, 2.07)
## Inc risk ratio (crude:M-H)                     1.14
## Odds ratio (crude)                             4.63 (3.63, 5.92)
## Odds ratio (M-H)                               3.99 (3.10, 5.14)
## Odds ratio (crude:M-H)                         1.16
## Attrib risk in the exposed (crude) *           36.56 (31.28, 41.83)
## Attrib risk in the exposed (M-H) *             31.71 (17.31, 46.11)
## Attrib risk (crude:M-H)                        1.15
## -------------------------------------------------------------------
##  M-H test of homogeneity of PRs: chi2(1) = 0.031 Pr>chi2 = 0.860
##  M-H test of homogeneity of ORs: chi2(1) = 3.323 Pr>chi2 = 0.068
##  Test that M-H adjusted OR = 1:  chi2(1) = 122.411 Pr>chi2 = <0.001
##  Wald confidence limits
##  M-H: Mantel-Haenszel; CI: confidence interval
##  * Outcomes per 100 population units

デフォルトの設定 (Mantel-Haenszel) だと、効果の指標は次のとおりです。 リスク比とオッズ比との間に大きな乖離がありますね。

risk ratio: 1.88 [95%CI: 1.70-2.07]
odds ratio: 3.99 [95%CI: 3.10-5.14]

Odds ratio vs Risk ratio

こちらのブログの指摘はなかなか的を射ていると感じます。

prevalenceが10％を超えると、ORが「X倍」に近似できないのはその通りだと思うが、ORはORのままで議論すればよい（例：ORが2倍でした、などと記述すればよい）気がしていた。 relative riskの議論に持ち込まなければORで議論しても問題ない気がするが、それで議論を進めて何かまずいものでしょうか。

しかしながら、疫学の流れを考えればrisk differenceやrisk ratioが根本にあり、odds ratioは代替的に使うものと考えることが自然でしょう。したがって、ORがrisk ratioを大きく超える (exaggerate) 状況なのであれば、効果や関連の指標としてORを使うことはあまり望ましくないと考えられます。 (i.e. イベントの発生が稀でない、など)

ということで、可能ならオッズ比ではなくリスク比を評価していきたい、ということが今後の流れです。

回帰分析による推定

さきほどの2×2表での計算で分析が済むなら楽なのですが、実際の研究では回帰分析を使うことの方が多いと思います。そこで今回は、どのような回帰分析手法を使えばよいのか？という課題を取り上げます。

ロジスティック回帰

はじめに、頻繁に使われているなロジスティック回帰について。これはオッズ比を導くので、今回の状況ではあまり望ましくない分析手法となります。

logistic <- glm(outcome ~ exposure + sex, data = df, family = binomial(link = "logit"))
summary(logistic)

## 
## Call:
## glm(formula = outcome ~ exposure + sex, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6204  -1.0265  -0.6963   1.3362   1.7527  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.36578    0.06564  -5.573 2.51e-08 ***
## exposure     1.36527    0.12794  10.671  < 2e-16 ***
## sex         -0.92773    0.10395  -8.924  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2674.1  on 1999  degrees of freedom
## Residual deviance: 2427.2  on 1997  degrees of freedom
## AIC: 2433.2
## 
## Number of Fisher Scoring iterations: 4

exp(coef(logistic)) # OR

## (Intercept)    exposure         sex 
##   0.6936556   3.9167791   0.3954505

exp(confint(logistic)) # 95%CI

##                 2.5 %    97.5 %
## (Intercept) 0.6096078 0.7885442
## exposure    3.0549176 5.0462776
## sex         0.3221285 0.4842474

ポアソン回帰

通常のポアソン回帰

一般に、ポアソン回帰は0以上の整数値を取るデータに対して適用される回帰モデルで、特に「稀なイベント」に使われています。例えば、交差点での交通事故発生件数や製品製造ラインでの故障品などが代表的でしょうか。また、故障品のモデリングなど「全体でいくつ作っているか」も重要になる場合は、 オフセット項を導入することで、割合もモデリングすることが可能になります。FYI: J-Stageの解説論文

また、ポアソン回帰は0/1のbinary outcomeに対しても適用することができます。イメージする上では、Petersen, et al. 2008のフレーズがわかりやすいように感じます。

It is well known that when the prevalence is low and the sample size is large, probabilities from the Poisson distribution can often be used to approximate probabilities from the binomial distribution. Similarly, one can think of an existing sample of binomial data (0 or 1) as being approximately Poisson, where the probability of a value of 2 or greater is low enough that no values greater than 1 occurred in the obtained sample.

もし、ポアソン回帰をbinary outcomeに対して用いる場合は、Zou, 2004を引用すると良さげです (鉄板論文らしい)。それでは、実際にすすめていきます。まずは通常のポアソン回帰から。

poisson <- glm(outcome ~ exposure + sex, data = df, family = poisson(link = "log"))
summary(poisson)

## 
## Call:
## glm(formula = outcome ~ exposure + sex, family = poisson(link = "log"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2286  -0.8966  -0.6704   0.7917   1.1979  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.91140    0.05074 -17.960  < 2e-16 ***
## exposure     0.62997    0.07837   8.038 9.14e-16 ***
## sex         -0.58137    0.08460  -6.872 6.34e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1469.0  on 1999  degrees of freedom
## Residual deviance: 1330.6  on 1997  degrees of freedom
## AIC: 2894.6
## 
## Number of Fisher Scoring iterations: 5

exp(coef(poisson)) # risk ratio

## (Intercept)    exposure         sex 
##   0.4019608   1.8775456   0.5591328

exp(confint(poisson)) # 95%CI, basic poisson regression

##                 2.5 %    97.5 %
## (Intercept) 0.3633534 0.4433375
## exposure    1.6080557 2.1867087
## sex         0.4726644 0.6586717

ここで注目していただきたいポイントは信頼区間です。今回のポアソン回帰では、RR = 1.88 [95%CI: 1.61-2.19] となりました。

一方、2×2表で得たリスク比は 1.88 [95%CI: 1.70-2.07] であり、ポアソン回帰で得た信頼区間の幅の方が大きくなっています。

ここが、Zou (2004) の言うところの “On the other hand, use of Poisson regression tends to provide conservative results” にあたるのでしょう。

ロバスト分散を使った修正ポアソン回帰

そこで、Zouの手法による修正ポアソン回帰のロバスト分散の計算を実行します。
参考1: Zou’s Modified Poisson Regression, 参考2: いろいろ情報盛りだくさん

pacman::p_load(lmtest, sandwich)

modified_poiss <- coeftest(poisson, vcov = sandwich)

係数は同じですね。一方、標準誤差は随分違います。修正ポアソン回帰の方が、小さな標準誤差を得ることに成功しました。

modified_poiss # 修正ポアソン回帰

## 
## z test of coefficients:
## 
##              Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept) -0.911401   0.039205 -23.2469 < 2.2e-16 ***
## exposure     0.629965   0.050511  12.4718 < 2.2e-16 ***
## sex         -0.581368   0.070023  -8.3025 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

では、修正ポアソン回帰に基づく信頼区間を計算します。面倒なことに、自分で計算をしなくてはならないようです。

GetConfint <- function(res_coeftest, siglevel=0.05, digits=4){
  temp <- exp(cbind(
    RiskRatio = res_coeftest[,1], 
    LowerCI = res_coeftest[,1] + qnorm(siglevel/2)*res_coeftest[,2],
    UpperCI = res_coeftest[,1] - qnorm(siglevel/2)*res_coeftest[,2]
  ))
  p_value <- res_coeftest[,4]
  
  result <- cbind(temp, p_value = p_value)
  result <- round(result, digits)
  
  return(result)
}

GetConfint(modified_poiss)

##             RiskRatio LowerCI UpperCI p_value
## (Intercept)    0.4020  0.3722  0.4341       0
## exposure       1.8775  1.7006  2.0729       0
## sex            0.5591  0.4874  0.6414       0

2×2表で得たリスク比は 1.88 [95%CI: 1.70-2.07] ですから、通常のポアソン回帰で得た1.88 [95%CI: 1.61-2.19]よりも狭い信頼区間を得ることができました。

※ 2×2表に戻るなら→ データのチェック

log-binomial

log binomialの回帰について、勉強することは初めてのことです。そこで、モデルの式も含めて記していきます。

モデル式

GLM族なので、a) 線形予測子、b) リンク関数、c) 確率分布の3つで記述することができます。
やっていきましょう。

log-binomial regressionのモデルは以下の通りです（説明変数が2つとします）。
参考: Chen, et al. 2018

\[\log{(p_i)} = b_0 +b_1x_{i1}+b_2x_{i2} \] \[y_i \sim Binomial(n, p_i)\]

ただ、今回は個人が集計単位でy_iは0/1のみがアウトカムなので、n=1となります。
(二項分布なので、コイン投げを1回するイメージですね)

なぜ”log”?

ロジスティック回帰はとても良くできたモデルだと思います。

解釈が比較的簡単
確率のシグモイド的な振る舞いを考えることができる（「ある閾値を超えるとイベントが起こる」といった考え方）
0≤p≤1の確率pに対し、0≤ p/(1-p) <∞であり、log(p/(1-p))は負の無限大から無限大を取るので、被説明変数として適切

そんな中log-binomialモデルでは、なぜlog(p)を考えるのでしょうか？オッズではなく確率をモデルできることは強みです。ただし、0≤p≤1の確率pに対し、log(p)は負の無限大から0を取るので、実数全体を取りうるわけではありません。ここがどうにも妙なところです。

この妙な点は、収束が難しいことも引き起こしてしまうようです。

However, the Log link function in Log-Binomial models restricts the probabilities of an outcome to be greater than or equal to zero, that is, to fall within the bounds [0, ∞). Due to this mismatch between the bounds of the model and the allowable outcome, in practice, the Log-Binomial model will routinely fail to converge and will not provide the parameter estimates (Localio et al. 2007).

(from: SAS)

なんだか難しそうですが、とりあえずやってみましょう。

log-binomial regの実践

logbin <- glm(outcome ~ exposure + sex, data = df, family = binomial(link = "log"))

summary(logbin)

## 
## Call:
## glm(formula = outcome ~ exposure + sex, family = binomial(link = "log"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6735  -1.0139  -0.7141   1.3502   1.7271  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.91148    0.03919 -23.257   <2e-16 ***
## exposure     0.62844    0.04974  12.635   <2e-16 ***
## sex         -0.57996    0.06956  -8.338   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2674.1  on 1999  degrees of freedom
## Residual deviance: 2423.8  on 1997  degrees of freedom
## AIC: 2429.8
## 
## Number of Fisher Scoring iterations: 6

exp(coef(logbin)) # risk ratio

## (Intercept)    exposure         sex 
##   0.4019306   1.8746794   0.5599207

exp(confint(logbin)) # 95%CI

##                 2.5 %    97.5 %
## (Intercept) 0.3714111 0.4331468
## exposure    1.6988379 2.0656133
## sex         0.4868184 0.6397534

2×2表で得たリスク比は 1.88 [95%CI: 1.70-2.07] でしたから、普通にできてますね。

収束の問題が起こるのは、データが十分にない（サンプルサイズ・イベント数など）場合のようです。Williamson, et al. 2013

Recently there was a paper published in Stroke [4], where in the statistical methods section the authors indicated that: “As a first approach to the multivariable analysis, we used a log-binomial model, but owing to the sparseness of data, this failed to converge.

デフォルトの設定では、大きな問題にはなりませんでした。

トラブルシューティング

もし、以下のようなエラーで最尤推定の際の初期値を求められた場合、glm中のstartを設定することで前進できる可能性があります。

Error: cannot find valid starting values: please specify some
エラー: 係数の有効なセットが見出されませんでした: 初期値を与えてください

参考1, 対応について, 参考2, 考え方について

考え方としては、「切片をlog(mean(y)), それ以外の係数を0にしよう！」ということです。構造モデルを考えれば、帰無仮説下H0の状況での考えを初期値としている、というイメージがつかめると思います。もちろん、「x_iが0のとき」という意味合いを考えて、適宜調整してくださいね。

\[\log{(p_i)} = b_0 +b_1x_{i1}+b_2x_{i2} \]

なお、上で紹介した参考1で述べられているコードには誤りがあります。
startの引数として、“start=c(log(mean(y), rep(0, np-1))” とありますが、 “start = c(log(mean(y)), rep(0, np-1))”としましょう。 log(0)は負の無限大となってしまい、定義できないからです。（単純なミスでしょう）

先ほど実践したlog-binomial modelにstartの引数を指定すると、こんな形になります。 rep(0, np-1)のnpは切片を含む説明変数の数とありますが、factor型の説明変数を投入している場合は水準の数だけ増えてしまうことにも注意してください。

# startの引数設定
## 説明変数が切片と2つの因子 (曝露と性別) 
## したがって、切片用にlog(mean(y)), 投入した説明変数用にrep(0, 2)となる
logbin2 <- glm(outcome ~ exposure + sex, data = df, family = binomial(link = "log"),
               start = c(log(mean(df$outcome)), rep(0, 2))) 


summary(logbin2)

## 
## Call:
## glm(formula = outcome ~ exposure + sex, family = binomial(link = "log"), 
##     data = df, start = c(log(mean(df$outcome)), rep(0, 2)))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6735  -1.0139  -0.7141   1.3502   1.7271  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.91148    0.03919 -23.257   <2e-16 ***
## exposure     0.62844    0.04974  12.635   <2e-16 ***
## sex         -0.57996    0.06956  -8.338   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2674.1  on 1999  degrees of freedom
## Residual deviance: 2423.8  on 1997  degrees of freedom
## AIC: 2429.8
## 
## Number of Fisher Scoring iterations: 5

まとめ

2値のイベントの発生割合が大きい場合、オッズ比はリスク比よりも大きくなる (away from nullの程度が大きくなる)
オッズ比よりもリスク比の方が疫学的な関心が高いので、オッズ比とリスク比の乖離は問題となる。したがって、イベントが多い状況でロジスティック回帰を用いることは、慎重に考えるべき。
修正ポアソン回帰とlog-binomial regression (対数二項回帰) が代替候補。

参考文献

Zou, G. (2004). A modified poisson regression approach to prospective studies with binary data. American journal of epidemiology, 159(7), 702-706.
Petersen, M. R., & Deddens, J. A. (2008). A comparison of two methods for estimating prevalence ratios. BMC medical research methodology, 8(1), 1-9.
Chen, W., Qian, L., Shi, J., & Franklin, M. (2018). Comparing performance between log-binomial and robust Poisson regression models for estimating risk ratios under model misspecification. BMC medical research methodology, 18(1), 1-12.
Williamson, T., Eliasziw, M., & Fick, G. H. (2013). Log-binomial models: exploring failed convergence. Emerging themes in epidemiology, 10(1), 1-10.

稀でない2値イベントに対する回帰モデル