17 7월 2022 4 min read Data

Bayesian Count Estimation

Introudunction

파이썬을 활용한 베이지안 통계 2판을 보면서 제대로 이해하고 있는지 기록하고 싶은 목적에서 Chapter5 수량추정의 문제를 직접 풀어본다.

기관차 문제(The Train Problem)

Source: Fifty Challenging Problems in Probability with Solutions

각 철도에는 이를 지나가는 기관차에 1부터 N까지의 순서로 번호를 붙인다. 어느날 60호 기관차를 보았다. 이 때 이 철도에는 몇 개의 기관차가 지나가는지 추측해보자.

문제는 다음 두 단계로 나누어볼 수 있다.
- 데이터를 보기 전에 N에 관해서 무엇을 알고 있는가?
- 주어진 모든 N값에 대해서 관측한 데이터(60호 기관차)는 어떻게 되는가?
결국 이 문제는 총 기관차 N대가 있는 상황에서 우리가 보고 있는 노선에서 운행되고 있는 기관차의 수이다.

Prior Distribution & Likelihood Function

사전확률($P(H)$)은,최대 1,000대의 기차가 존재할 수 있다고 할 때 모든 기차의 출현확률은 동일하다고 하면 다음과 같이 표시할 수 있다.

import pandas as pd
prior = pd.DataFrame(np.ones(1000).astype(int),columns=['probs'],index=np.arange(1,1001))

60호 기관차를 보았다고 할 때 Likelihood Function($P(D|H)$), 각 기관차를 볼 확률도 우선은 $1/ N$로 생각할 수 있다.

Update Function

이제 데이터가 추가가 됨에 따라서 Posterior Distribution($P(H|D)$)가 어떻게 Update를 봐야할 필요가 있다.
60호 기관차를 봤다는 데이터만 존재하고 있기 때문에 60호 기관차에 대한 부분의 확률이 가장 높게 나올 것이다.
아래 코드를 보면 hypos = np.array(posterior.index)는 사전확률분포 값에서 Hypothesis를 가지고 오는 것이다. 현 시점에서 hypothesis는 1이 나올 사건, 2가 나올 사건, $...$, 1000이 나올 사건이다. 그리고 가능도는 각 사건의 확률분인데 이를 균등분포로 보고 1/hypos로 계산하였다.
posterior* likelihood는 $P(H)P(D|H)$이고 normalize는 이 값을 $P(D)$로 나눠주는 부분이다.

data = 60
posterior = prior.copy()
hypos = np.array(posterior.index)
likelihood = 1 / hypos
impossible = (data > hypos)
likelihood[impossible] = 0
posterior['unnormalized'] = posterior['probs'] * likelihood
posterior['normalized']  = posterior['unnormalized'] / posterior['unnormalized'].sum()

import matplotlib.pyplot as plt
def decorate(**options):
    """Decorate the current axes.
    
    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')
             
    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    ax = plt.gca()
    ax.set(**options)
    
    handles, labels = ax.get_legend_handles_labels()
    if handles:
        ax.legend(handles, labels)

    plt.tight_layout()
    
posterior.plot(label='Posterior after train 60',y='normalized')
decorate(xlabel='Number of trains',
         ylabel='PMF',
         title='Posterior distribution')

사후확률분포의 평균

사후확률 분포의 평균을 이용해서 점추정을 시도할 수도 있다.PMF(Probability Mass Funcation)에서 평균은 다음과 같이 구할 수 있고, 333개의 기관차가 지나간다고 추정할 수도 있다.

$$mean = \sum\limits_ip_iq_i$$

np.sum(posterior['normalized'] * np.array(posterior.index))

데이터 양을 늘린다면?

점차 사후확률 분포 평균 값이 작아지고는 있으나, 60,30,90번호 기차 세 대 정도의 데이터로는 부족하다. 참고로 사후확률 분포의 평균은 164.30558642273346이었다.

import pandas as pd
N = 1000
prior = pd.DataFrame(np.ones(N).astype(int),columns=['probs'],index=np.arange(1,N+1))
posterior = prior.copy()

dataset = [60,30,90]
for data in dataset:
    update(posterior,data)