A little closer to Cook’s distance

Cook’s distance and outliers? How can they relate to each other?


The cut-off values — controversial:

  1. If a data point has a Cook’s distance of more than three times the mean, it is a possible outlier
  2. Any point over 4/n, where n is the number of observations, should be examined
  3. To find the potential outlier’s percentile value using the F-distribution. A percentile of over 50 indicates a highly influential point

When to use Cook’s D

  1. When suspect influence problems
  2. When graphical displays may not be adequate
  3. When performing a least-square regression analysis
from statsmodels.formula.api import ols
infl = model.get_influence()
sm_fr = infl.summary_frame(); sm_fr[:10]

How to interpret Cook’s Distance plots

df.bedrooms.max() #output: 33df.bedrooms.idxmax() #output: 15856df.drop(df.loc[df['bedrooms']==33].index, inplace=True) #dropping the row from the bedrooms columnf = 'price~bedrooms'
model = smf.ols(formula=f, data=df).fit()
model.summary() #running the OLS model again

Other case of influence statistics


