Robust fit
这一篇差不多是作为这个的后续
scipy.optimize.least_squares
用来搞Robust fit的有好几个,这里先从scipy的这个函数讲起
先看一下scipy cookbook里面的这个例子
这个例子里拟合了一个正弦函数
1 | def generate_data(t, A, sigma, omega, noise=0, n_outliers=0, random_state=0): |
确定模型参数
1 | A = 2 |
将三个离群值放在fitting dataset里
1 | t_train = np.linspace(t_min, t_max, 30) |
定义损失函数
1 | def fun(x, t, y): |
剩下就是一些常规的过程
1 | x0 = np.ones(3) |
很清楚的可以看出来,这里robust lsq结果明显更接近那个true的线
这里只用了soft l1,在least_squares里还有另外几个
loss str or callable, optional
Determines the loss function. The following keyword values are allowed:
- ‘linear’ (default) :
rho(z) = z
. Gives a standard least-squares problem.- ‘soft_l1’ :
rho(z) = 2 * ((1 + z)**0.5 - 1)
. The smooth approximation of l1 (absolute value) loss. Usually a good choice for robust least squares.- ‘huber’ :
rho(z) = z if z <= 1 else 2*z**0.5 - 1
. Works similarly to ‘soft_l1’.- ‘cauchy’ :
rho(z) = ln(1 + z)
. Severely weakens outliers influence, but may cause difficulties in optimization process.- ‘arctan’ :
rho(z) = arctan(z)
. Limits a maximum loss on a single residual, has properties similar to ‘cauchy’.
比较难受的就是这次没法像curve_fit那个函数一样那么好用了
就 我还要自己手动写个cost func
另一个例子
这个例子来自于这儿。
感谢这个例子教会我写cost func
Define the model function as y = a + b * exp(c * t)
, where t is a predictor variable, y is an observation and a, b, c are parameters to estimate
First, define the function which generates the data with noise and outliers, define the model parameters, and generate data:
1 | def gen_data(t, a, b, c, noise=0, n_outliers=0, random_state=0): |
Define function for computing residuals and initial estimate of parameters.
1 | def fun(x, t, y): |
NOTE: x0,x1,x2 corresponding to a,b,c; and t is what we always treat as x
Compute a standard least-squares solution:
1 | res_lsq = least_squares(fun, x0, args=(t_train, y_train)) |
Now compute two solutions with two different robust loss functions. The parameter f_scale is set to 0.1, meaning that inlier residuals should not significantly exceed 0.1 (the noise level used).
1 | 'soft_l1', f_scale=0.1, res_soft_l1 = least_squares(fun, x0, loss= |
1 | t_test = np.linspace(t_min, t_max, n_points * 10) |
这个例子后面还有一个解决复数优化的问题,可真是牛逼
scikit learn
这个大体上相同,
1 | from matplotlib import pyplot as plt |
有个PolynomialFeatures(3)的意思是,可以从0阶多项式拟合到3阶多项式
Robust fitting is demoed in different situations:
- No measurement errors, only modelling errors (fitting a sine with a polynomial)
- Measurement errors in X
- Measurement errors in y
The median absolute deviation to non corrupt new data is used to judge the quality of the prediction.
What we can see that:
- RANSAC is good for strong outliers in the y direction
- TheilSen is good for small outliers, both in direction X and y, but has a break point above which it performs worse than OLS.
- The scores of HuberRegressor may not be compared directly to both TheilSen and RANSAC because it does not attempt to completely filter the outliers but lessen their effect.
iterative bi-square method
其实我最想找的是这个,这个来源于这篇文献
但是找来找去都没找到python的包
啊 又逼着我改用R了