Scikit-learn Pitfalls

#machine learning #pitfalls #scikit learn #programming
  • Use pipelines for both pre-processing transformations and models
  • Test set is only for testing, nothing else
    • Always split the data first
    • fit* functions can never see test set, but usage of transform should be consistent
  • Randomness
    • Use your own global RandomState to make sure it is reproducible, but also have some randomness between cross validations splits
    • Avoid setting global random seed, as it will fix any randomness for any code involved