用Octave计算随机变量的相关系数
广告
{{v.name}}
相关系数(又称皮尔逊相关系数)是衡量两个随机变量之间线性相关程度强弱的标准化指标,它消除了协方差的量纲影响,取值范围被限定在\(\left[-1,1\right]\) 内,比协方差更具解释性。
相关系数(又称皮尔逊相关系数)的定义公式
相关系数定义为:\(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sqrt{D(X)}\cdot\sqrt{D(Y)} }\)
其中 \(\text{Cov}(X,Y)\) 是随机变量 \(X\) 和 \(Y\) 的协方差,\(D(X)\) 和 \(D(Y)\) 分别是随机变量 \(X\) 和 \(Y\) 的方差。
无论 \(X,Y\) 是离散型还是连续型随机变量,计算相关系数的核心步骤都是先求协方差和两个变量的标准差,再代入定义公式。
离散型随机变量的相关系数
设离散型随机变量 \((X,Y)\) 的联合分布律为 \(P(X=x_i,Y=y_j)=p_{ij}\)(\(i,j=1,2,\dots\)),则:
1. 计算 \(X\) 和 \(Y\) 的边缘期望:\(E(X)=\sum_{i}x_i p_i \quad,\quad E(Y)=\sum_{j}y_j p_j\)
其中 \(p_i=\sum_{j}p_{ij}\),\(p_j=\sum_{i}p_{ij}\) 是边缘分布律。
2. 计算 \(XY\) 的数学期望:
\(E(XY)=\sum_{i}\sum_{j}x_i y_j p_{ij}\)
3. 计算协方差:
\(\text{Cov}(X,Y)= \sum_{i}\sum_{j}x_i y_j p_{ij} - \left( \sum_{i}x_i p_i\right) \left( \sum_{j}y_j p_j\right) \)
4. 计算标准差:
\(\sigma_X = \sqrt{D(X)} = \sqrt{\sum_{i}x_i^2 p_i - \left( \sum_{i}x_i p_i\right)^2 }\)
\(\sigma_Y = \sqrt{D(Y)} = \sqrt{\sum_{j}y_j^2 p_j - \left( \sum_{j}y_j p_j\right)^2 }\)
5. 计算相关系数:
\(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sqrt{D(X)}\cdot\sqrt{D(Y)} }\)
例如:设 \((X,Y)\) 的联合分布如下:
| \(X\setminus Y\) | 1 | 2 |
|----------------|---|---|
| 0 | 0.2 | 0.3 |
| 1 | 0.4 | 0.1 |
- 计算边缘期望:
\(E(X)=0\times(0.2+0.3)+1\times(0.4+0.1)=0.5\)
\(E(Y)=1\times(0.2+0.4)+2\times(0.3+0.1)=1\times0.6+2\times0.4=1.4\)
- 计算 \(E(XY)\):
\(E(XY)=0\times1\times0.2+0\times2\times0.3+1\times1\times0.4+1\times2\times0.1=0.6\)
- 计算协方差:
\(\text{Cov}(X,Y) = E(XY) - E(X)E(Y) = 0.6 - 0.5 \times 1.4 = 0.6 - 0.7 = -0.1\)
- 计算标准差:
\(\sigma_X = \sqrt{D(X)} = \sqrt{0^2\times0.5 + 1^2\times0.5 - 0.5^2} = \sqrt{0.5 - 0.25} = \sqrt{0.25} = 0.5\)
\(\sigma_Y = \sqrt{D(Y)} = \sqrt{1^2\times0.6 + 2^2\times0.4 - 1.4^2} = \sqrt{0.6 + 1.6 - 1.96} = \sqrt{0.24} \approx 0.4899\)
- 计算相关系数:
\(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y} = \frac{-0.1}{0.5 \times 0.4899} \approx -0.4082\)
程序代码如下
function ret = get_pearson_correlation_coefficient_discrete(p_ij, x_list, y_list)
cov_xy = get_covariance_discrete(p_ij, x_list, y_list);
p_i = sum(p_ij, 2);
EX = sum(x_list .* p_i');
EX2 = sum((x_list .^ 2) .* p_i');
DX = EX2 - EX^2;
p_j = sum(p_ij, 1);
EY = sum(y_list .* p_j);
EY2 = sum((y_list .^ 2) .* p_j);
DY = EY2 - EY^2;
ret = cov_xy / (sqrt(DX) * sqrt(DY));
endfunction
>> get_pearson_correlation_coefficient_discrete([0.2 0.3; 0.4 0.1], [0, 1], [1, 2])
ans = -0.4082
连续型随机变量的协方差
设连续型随机变量 \((X,Y)\) 的联合概率密度为 \(f(x,y)\),则计算步骤如下:
1. 计算 \(X\) 和 \(Y\) 的边缘期望:\(E(X)=\int_{-\infty}^{+\infty}x f_X(x)dx \quad,\quad E(Y)=\int_{-\infty}^{+\infty}y f_Y(y)dy\)
其中 \(f_X(x)=\int_{-\infty}^{+\infty}f(x,y)dy\),\(f_Y(y)=\int_{-\infty}^{+\infty}f(x,y)dx\) 是边缘概率密度。
2. 计算 \(XY\) 的数学期望:
\(E(XY)=\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}xy f(x,y)dxdy\)
3. 代入实用公式得协方差:
\(\text{Cov}(X,Y)=\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}xy f(x,y)dxdy - E(X)E(Y)\)
4. 计算标准差:
\(\sigma_X = \sqrt{D(X)} = \sqrt{\int_{-\infty}^{+\infty}x^2 f_X(x)dx - \left( \int_{-\infty}^{+\infty}x f_X(x)dx\right)^2 }\)
\(\sigma_Y = \sqrt{D(Y)} = \sqrt{\int_{-\infty}^{+\infty}y^2 f_Y(y)dy - \left( \int_{-\infty}^{+\infty}y f_Y(y)dy\right)^2 }\)
5. 计算相关系数:
\(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sqrt{D(X)}\cdot\sqrt{D(Y)} }\)
例如:均匀分布。设 \((X,Y)\) 在区域 \(0\le x\le1,0\le y\le1\) 上均匀分布,联合密度 \(f(x,y)=1\)。
- 计算边缘期望:\(E(X)=\int_0^1 x dx=\frac{1}{2}\),\(E(Y)=\frac{1}{2}\)
- 计算 \(E(XY)\):\(E(XY)=\int_0^1\int_0^1 xy dxdy=\frac{1}{4}\)
- 计算协方差:\(\text{Cov}(X,Y)=\frac{1}{4}-\frac{1}{2}\times\frac{1}{2}=0\)
- 计算标准差:
\(\sigma_X = \sqrt{D(X)} = \sqrt{\int_0^1 x^2 dx - \left( \int_0^1 x dx\right)^2 } = \sqrt{\frac{1}{3} - \left(\frac{1}{2}\right)^2} = \sqrt{\frac{1}{12} } \approx 0.2887\)
\(\sigma_Y = \sqrt{D(Y)} = \sqrt{\int_0^1 y^2 dy - \left( \int_0^1 y dy\right)^2 } = \sqrt{\frac{1}{3} - \left(\frac{1}{2}\right)^2} = \sqrt{\frac{1}{12} } \approx 0.2887\)
- 计算相关系数:
\(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y} = \frac{0}{0.2887 \times 0.2887} = 0\)
程序代码如下
function ret = get_pearson_correlation_coefficient_continuous(f, a, b, c, d)
pkg load symbolic;
syms x y;
f_X = int(f, y, c, d);
f_Y = int(f, x, a, b);
EX = 0;
EY = 0;
EXY = 0;
try
EXY = double(int(int(x * y * f, x, a, b), y, c, d));
EX = double(int(x * f_X, x, a, b));
EX2 = double(int(x^2 * f_X, x, a, b));
EY = double(int(y * f_Y, y, c, d));
EY2 = double(int(y^2 * f_Y, y, c, d));
catch
EXY = int(int(x * y * f, x, a, b), y, c, d);
EX = int(x * f_X, x, a, b);
EX2 = int(x^2 * f_X, x, a, b);
EY = int(y * f_Y, y, c, d);
EY2 = int(y^2 * f_Y, y, c, d);
end_try_catch
DX = EX2 - EX^2;
DY = EY2 - EY^2;
cov_xy = EXY - EX * EY;
ret = cov_xy / (sqrt(DX) * sqrt(DY));
endfunction
>> a = 0;
>> b = 1;
>> c = 0;
>> d = 1;
>> f = 1;
>> get_pearson_correlation_coefficient_continuous(f, a, b, c, d);
ans = 0