/Handling missing data with Pandas

Handling missing data with Pandas

Real data is always imperfect and requires cleaning. Our desired prediction algorithm can work well only if it is fed with the right data. Hence, preprocessing of data is of utmost importance and it also turns out to be the most time-consuming step.

We can deal with a dataset with missing values in many ways:

  1. Fill them with a constant value
    • Sometimes missing values are Missing Not at Random (MNAR) i.e. that they are missing for some reason that is reflected by what their value should be. For income, we assume that values are missing because those with very high incomes preferred not to state them. Then, we can make a reasonable guess for what “high” means and fill in the blanks. It will still be inaccurate, but more accurate than it was.
  2. Fill them with an average value
  3. Fill them with a median value
  4. Fill them with mode(the most occurring value)
  5. Drop rows having missing values
  6. Drop a column(feature) if it has too many missing values so that we don’t drop too many rows
  7. Imputation is a way of using features to model each other. That way, when one is missing, the others can be used to fill in the blank in a reasonable way.
    • Multivariate Imputation by Chained Equations (MICE) – This approach is particularly powerful when features are somewhat related. In this example, someone with many years seniority is likely to have a higher age. This means that the two features share information, and that when one is absent the other can be used to estimate it.
  8. Try with and without any change to a feature. Sometimes the algorithm can handle the absence of data
  9. If you are dealing with stock price data, fill the data of missing dates with data available at last date

Pandas has a very powerful missing data handling features

    1. df.fillna(0) OR df.fillna(‘missing’) – Filling with a constant value
    1. df.fillna(method = ‘pad’, limit = 1)  – Filling with the last known value
      1. df.fillna(method=’ffill’) – Fills forward (similar to pad)
      2. df.fillna(method=’bfill’) – Fills backward
    1. df.fillna(df.mean())
    1. df.dropna(axis=0) – Drops rows with missing data
    1. df.dropna(axis=1) – Drops columns with missing data
  1. Interpolation
    1. df.interpolate() – Linear interpolation along that feature
    2. By default, limit applies in a forward direction, so that only NaN values after a non-NaN value can be filled. If you provide ‘backward’ or ‘both’ for the limit_direction keyword argument, you can fill NaN values before non-NaN values, or both before and after non-NaN values, respectively. Ex. df.interpolate(limit=1, limit_direction=’both’)
    3. Like other pandas fill methods, interpolate accepts a limit keyword argument. Use this argument to limit the number of consecutive interpolations, keeping NaN values for interpolations that are too far from the last valid observation
    4. If you are dealing with a time series that is growing at an increasing rate, method=’quadratic‘ may be appropriate. If you have values approximating a cumulative distribution function, then method=’pchip‘ should work well. To fill missing values with goal of smooth plotting, use method=’akima
    5. df.interpolate(method=’polynomial’, order=2)
    6. df.interpolate(method = ‘values’) – Linear interpolation using values of other features
    7. df.interpolate(method = ‘time’) – Linear interpolation using the dates associated with that feature
  2. Replace using RegEx list
    1. df.replace([r’\.’, r'(a)’], [‘dot’, ‘\1stuff’], regex=True)

An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!