Data

Gaussian Process은 무한공간에서 사전정보를 결합하기 위해 사용된다.

Bongho, Lee

2024년 7월 22일 — 5 min read

Photo by Divyanshi Verma / Unsplash

Definition

연속적인 함수 공간에서 정의된 확률 과정으로, 주어진 데이터로부터 함수의 분포를 추정하는 강력한 비모수적 방법입니다.
Gaussian Process는 주로 회귀 분석, 분류, 최적화 문제에서 사용되며, 특히 불확실성을 정량화하는 데 유용합니다.
Bayesian Analytics에서 Infinite Space에 대해 Prior를 적용하기 위한 방안으로 활용됩니다.

Pros & Cons

Pros

비모수적 접근: 모델의 구조를 사전 정의하지 않고 데이터로부터 학습합니다.
불확실성 정량화: 예측 값뿐만 아니라 예측의 불확실성도 제공하여 모델의 신뢰도를 평가할 수 있습니다.
유연성: 다양한 커널 함수를 사용하여 복잡한 데이터 분포를 모델링할 수 있습니다.
베이지안 접근: 사전 정보를 결합하여 예측의 신뢰성을 높일 수 있습니다.

Cons

고계산 비용: 큰 데이터셋에 대해 계산 비용이 높아질 수 있습니다.
메모리 소모: 데이터 포인트 수에 따라 메모리 사용량이 증가합니다.
하이퍼파라미터 선택: 최적의 커널 함수와 하이퍼파라미터를 선택하는 과정이 복잡할 수 있습니다.

Sample

import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
from sklearn.gaussian_process import GaussianProcessRegressor  
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C  
  
# 데이터 생성  
np.random.seed(42)  
  
# 배달 거리 (km)distance = np.random.uniform(1, 20, 100)  
# 주문량 (개)  
order_count = np.random.uniform(1, 10, 100)  
# 시간대 (0: 기타, 1: 점심, 2: 저녁)  
time_of_day = np.random.choice([0, 1, 2], size=100)  
# 라이더 배차 수락율 (%)rider_acceptance_rate = np.random.uniform(50, 100, 100)  
# 라이더 규모 (명)  
rider_scale = np.random.uniform(5, 20, 100)  
  
# 배달시간 (분)  
delivery_time = 30 + 3 * distance + 2 * order_count + 100 / rider_acceptance_rate + 100 / rider_scale + np.random.normal(0, 5, 100)  
# 점심시간 (1: 11AM - 1PM), 저녁시간 (2: 6PM - 8PM)delivery_time[time_of_day == 1] += np.random.normal(0, 20, sum(time_of_day == 1))  # 점심시간은 분산이 큼  
delivery_time[time_of_day == 2] += np.random.normal(0, 10, sum(time_of_day == 2))  # 저녁시간은 분산이 보통  
  
# 특성 배열  
X = np.vstack((distance, order_count, time_of_day, rider_acceptance_rate, rider_scale)).T  
y = delivery_time  
  
# 커널 정의 (상수 커널과 RBF 커널의 곱)  
kernel = C(1.0, (1e-3, 1e3)) * RBF(1, (1e-2, 1e2))  
  
# Gaussian Process 모델 학습  
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)  
gp.fit(X, y)  
  
# 예측할 데이터 생성  
distance_pred = np.linspace(1, 20, 30)  
order_count_pred = np.linspace(1, 10, 30)  
time_of_day_pred = np.array([0, 1, 2])  
rider_acceptance_rate_pred = np.linspace(50, 100, 30)  
rider_scale_pred = np.linspace(5, 20, 30)  
X_pred = np.array(np.meshgrid(distance_pred, order_count_pred, time_of_day_pred, rider_acceptance_rate_pred, rider_scale_pred)).T.reshape(-1, 5)  
  
# 예측  
y_pred, sigma = gp.predict(X_pred, return_std=True)  
  
# 결과 시각화  
plt.figure(figsize=(14, 7))  
  
# 원본 데이터 시각화  
plt.subplot(1, 2, 1)  
plt.scatter(distance, delivery_time, c=time_of_day, cmap='viridis', label='Training Data')  
plt.colorbar(label='Time of Day')  
plt.xlabel('Distance (km)')  
plt.ylabel('Delivery Time (min)')  
plt.title('Training Data Distribution')  
plt.legend()  
  
# 예측 데이터 시각화  
plt.subplot(1, 2, 2)  
plt.scatter(X_pred[:, 0], y_pred, c=X_pred[:, 2], cmap='viridis', s=10, label='Predictions')  
plt.fill_between(X_pred[:, 0], y_pred - 1.96 * sigma, y_pred + 1.96 * sigma, color='blue', alpha=0.2)  
plt.colorbar(label='Time of Day')  
plt.xlabel('Distance (km)')  
plt.ylabel('Delivery Time (min)')  
plt.title('Gaussian Process Regression Predictions')  
plt.legend()  
  
plt.tight_layout()  
plt.show()](<import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
from sklearn.gaussian_process import GaussianProcessRegressor  
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C  
  
# 데이터 생성  
np.random.seed(42)  
  
# 배달 거리 (km)distance = np.random.uniform(1, 20, 100)  
# 주문량 (개)  
order_count = np.random.uniform(1, 10, 100)  
# 시간대 (0: 기타, 1: 점심, 2: 저녁)  
time_of_day = np.random.choice([0, 1, 2], size=100)  
# 라이더 배차 수락율 (%)rider_acceptance_rate = np.random.uniform(50, 100, 100)  
# 라이더 규모 (명)  
rider_scale = np.random.uniform(5, 20, 100)  
  
# 배달시간 (분)  
delivery_time = 30 + 3 * distance + 2 * order_count + 100 / rider_acceptance_rate + 100 / rider_scale + np.random.normal(0, 5, 100)  
# 점심시간 (1: 11AM - 1PM), 저녁시간 (2: 6PM - 8PM)delivery_time[time_of_day == 1] += np.random.normal(0, 20, sum(time_of_day == 1))  # 점심시간은 분산이 큼  
delivery_time[time_of_day == 2] += np.random.normal(0, 10, sum(time_of_day == 2))  # 저녁시간은 분산이 보통  
  
# 특성 배열  
X = np.vstack((distance, order_count, time_of_day, rider_acceptance_rate, rider_scale)).T  
y = delivery_time  
  
# 커널 정의 (상수 커널과 RBF 커널의 곱)  
kernel = C(1.0, (1e-3, 1e3)) * RBF(1, (1e-2, 1e2))  
  
# Gaussian Process 모델 학습  
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)  
gp.fit(X, y)  
  
# 예측할 데이터 생성  
distance_pred = np.linspace(1, 20, 30)  
order_count_pred = np.linspace(1, 10, 30)  
time_of_day_pred = np.array([0, 1, 2])  
rider_acceptance_rate_pred = np.linspace(50, 100, 30)  
rider_scale_pred = np.linspace(5, 20, 30)  
X_pred = np.array(np.meshgrid(distance_pred, order_count_pred, time_of_day_pred, rider_acceptance_rate_pred, rider_scale_pred)).T.reshape(-1, 5)  
  
# 예측  
y_pred, sigma = gp.predict(X_pred, return_std=True)  
  
# 결과 시각화  
plt.figure(figsize=(14, 7))  
  
# 원본 데이터 시각화  
plt.subplot(1, 2, 1)  
plt.scatter(distance, delivery_time, c=time_of_day, cmap='viridis', label='Training Data')  
plt.colorbar(label='Time of Day')  
plt.xlabel('Distance (km)')  
plt.ylabel('Delivery Time (min)')  
plt.title('Training Data Distribution')  
plt.legend()  
  
# 예측 데이터 시각화  
plt.subplot(1, 2, 2)  
plt.scatter(X_pred[:, 0], y_pred, c=X_pred[:, 2], cmap='viridis', s=10, label='Predictions')  
plt.fill_between(X_pred[:, 0], y_pred - 1.96 * sigma, y_pred + 1.96 * sigma, color='blue', alpha=0.2)  
plt.colorbar(label='Time of Day')  
plt.xlabel('Distance (km)')  
plt.ylabel('Delivery Time (min)')  
plt.title('Gaussian Process Regression Predictions')  
plt.legend()  
  
plt.tight_layout()  
plt.show()