Data Wrangling – preprocessing


Data Wrangling(preprocessing, prep, etc) is the most important and time consuming part of any data science project. Depending on the quality of data sources, 50%-60% of the initial effort is spent in extracting, cleaning, formatting, standardising, encoding categorical data, imputing missing values, removing junk data, slicing/dicing, other manipulations are performed on the data before it is ready for passing to the algorithms. Scikit-learn has a very good guide to perform most of these steps and worth going through. It covers

  • Standardization, or mean removal and variance scaling
  • Non-linear transformation
  • Normalization
  • Encoding categorical features
  • Discretization
  • Imputation of missing values
  • Generating polynomial features
  • Custom transformers

Imputation of missing values