04 12월 2022 2 min read Causal Inference

Average Treatment Effect

ATE(Average Treatment Effect)는 Treatment와 Control Group을 비교해서 Treatment의 Effect를 정의한 값이다. 이름이 의미하는 것처럼 Individual이 아닌 Group단위로 합한 후 평균 효과를 본다. 개개인의 Counterfactual을 보는 것은 불가능에 가깝기 때문이다. ATE를 수식으로 쓰면 다음과 같다(참고).

$$E[Y|T=1] - E[Y|T=0] = \underbrace{E[Y_1 - Y_0|T=1]}_{ATT} + \underbrace{\{ E[Y_0|T=1] - E[Y_0|T=0] \}}_{BIAS}$$

이 식을 Regression 형태로 생각해보면 다음과 같이 작성할 수 있다.

$$Y_{1i} = Y_{0i} + \kappa $$

$i$번째 Data Point에 대해서 전후의 변화는 $\kappa$로 설명하고 있다. 즉 $\kappa = ATE$이다. 이를 간략하게 Python Code로 보면 다음과 같다.

아래 Data는 Online 수업 여부에 따른 성적의 변화에 관한 데이터이다.

import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv("https://raw.githubusercontent.com/matheusfacure/python-causality-handbook/master/causal-inference-for-the-brave-and-true/data/online_classroom.csv")

result = smf.ols('falsexam ~ format_ol',data=df).fit()
result.summary().tables[1]

	gender	white	format_ol	format_blended	falsexam
0	0	1.0	0	0.0	63.29997
1	1	1.0	0	0.0	79.96000
2	1	1.0	0	1.0	83.37000
3	1	1.0	0	1.0	90.01994
4	1	1.0	1	0.0	83.30000

마지막 결과 값을 보면 다음과 같다. Intercept의 Coefficient가 77.8555, format_0l이 -4.2203이다. 식이 대략 $Exam = \beta_0 + \beta_1 formatol $ 형태일테니, Online 수업으로 할 경우 성적은 약 -4.2203이 떨어진다. 이게 ATE일 것이다.

coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	77.8555	0.762	102.235	0.000	76.357	79.354
format_ol	-4.2203	1.412	-2.990	0.003	-6.998	-1.443

확인차 Online 수업을 받은 학생과 그렇지 않은 학생의 성적을 보면 4.2203 정도의 평균 차이가 나는 것을 확인할 수 있다.

(df
 .groupby("format_ol")
 ["falsexam"]
 .mean())

0 77.855523 1 73.635263

References

Causal Inference for the Brave and True

References

You might also like...

[책] 리더의 돕는 법, "관계"로써 도움을 바라보기.

AB테스트를 하지 않을 때 발생할 수 있는 실수

MCP Server는 모델을 관리 및 서빙을 해주는 서버이다.

DataFrame은 Pandera로, 모델은 Pydantic으로 데이터를 검증한다.

Docker 모니터링하면서 죽으면 재시작시키는 스크립트