Beautiful Machine Learning Pipeline with Scikit-Learn
Doing feature engineering is the most complex part when applying machine learning to your product. This note aims to give better manners when using scikit-learn to do feature engineering and machine learning based my personal experience.
Before introducing my strategies, let's review common feature engineering problems first:
- handling missing values
- nomarlization / standardization
- featuree interaction
- label encoding
- one hot encoding
When starting feature engineering part in developing a machine learning model, we usually need to try many possible solutions and iterate different possible combinations of feature tricks quickly. There are many articles talking about how to do above feature engineering works. But when you want to apply different approaches to different features, you may write complicated code to do feature engineering. The code may contains multiple numpy / scipy transformation and feed back into scikit learn pipelines. It causes the code is not easy to maintain and hard to debug when problem occurs. In this article, I want to introduce multiple tricks in scikit-learn to build up a machine learning model pipeline that covers:
- feature engineering on different columns
- ensemble learning with customized transformers
- deep learning API with complicated feature pipeline
We define some code snippets about input / output data here before we talk about the detail:
train_data = pd.read_csv("input_data.csv")
train_labels = pd.read_csv("input_labels.cs v")
predict_data = pd.read_csv("predict_data.csv")
Idea 1. Naive Feature Engineering
Let's see how to do simple feature engineering if you don't apply pipeline.
Code Example
pca_transform = PCA(n_components=10)
pca_transform.fit(train_data.values)
pca_transform_data = pca_transform.transform(train_data.values)
nmf_transform = NMF()
nmf_transform.fit(train_data.values)
nmf_transform_data = nmf_transform.transform(train_data.values)
union_data = np.hstack(nmf_transform_data, pca_transform_data)
model = RandomForestClassifier()
model.fit(union_data, train_labels.values)
pca_transform_predict_data = pca_transform.transform(predict_data.values)
nmf_transform_predict_data = nmf_transform.transform(predict_data.values)
union_predict_data = np.hstack(
  pca_transform_predict_data, nmf_transform_predict_data)
predictions = model.predict(union_predict_data)
As you can see, this is pretty intuitive implementation if you want to apply some feature engineering tricks on your data. But you can imagine that the code will grow into a messy monster if you apply many tricks and to different features.
Pros
- Naive implementation
Cons
- Need to care many details with numpy / scipy interface
- Have many duplicate code to do similar things
Idea 2. Scikit Learn Model Pipeline
To make the whole operation more clean, scikit-learn provides pipeline API to let user create a machine learning pipeline without caring about detail stuffs.
Code Example
model_pipeline = Pipeline(steps=[
  ("dimension_reduction", PCA(n_components=10)),
  ("classifiers", RandomForestClassifier())
])
model_pipeline.fit(train_data.values, train_labels.values)
predictions = model_pipeline.predict(predict_data.values)
Pros
- Get rid of handling details between two stages.
- Code is easy to maintain
Cons
- If you use this implementation, only apply 1 type of transformation to given features. But this is the first step to make your pipeline more elegant.
Idea 3. Feature Union with Pipeline
If you want to apply different feature processing features on your dataset. You can try Feature Union API. The API provides simple way to merge arrays from different types transformation. Here is the code snippets if you want to use it:
Code Example
model_pipeline = Pipeline(steps=[
  ("feature_union", FeatureUnion([
    ("pca", PCA(n_components=1)),
    ("svd", TruncatedSVD(n_components=2))
  ])),
  ("classifiers", RandomForestClassifier())
])
model_pipeline.fit(train_data.values, train_labels.values)
predictions = model_pipeline.predict(predict_data.values)
Pros
- Use different feature transformer without seperating your code into several parts and compose them.
Cons
- Cannot apply different transformation by different features
- Cannot direct send pandas dataframe and use dict-like way to access data in your pipeline.
Idea 4. Idea 3 + Column Transformer
With Idea 3, you can easily implement your pipeline with different transformation. But there are two problems we mentioned above, we try to solve those problems and find a Column Transformer API after survey different materials. I pretty like this API because it makes you can simplify your pipeline like configuration and train / predict your data with a simple command.
Code Example
model_pipeline = Pipeline(steps=[
  ("features", FeatureUnion([
    (
      "numerical_features",
      ColumnTransformer([
        (
          "numerical",
          Pipeline(steps=[(
            "impute_stage",
            SimpleImputer(missing_values=np.nan, strategy="median",)
          )]),
          ["feature_1"]
        )
      ])
    ), (
      "categorical_features",
      ColumnTransformer([
        (
          "country_encoding",
          Pipeline(steps=[
            ("ohe", OneHotEncoder(handle_unknown="ignore")),
            ("reduction", NMF(n_components=8)),
          ]),
          ["country"],
        ),
      ])
    ), (
      "text_features",
      ColumnTransformer([
        (
          "title_vec",
          Pipeline(steps=[
            ("tfidf", TfidfVectorizer()),
            ("reduction", NMF(n_components=50)),
          ]),
          "title"
        )
      ])
    )
  ])),
  ("classifiers", RandomForestClassifier())
])
model_pipeline.fit(train_data, train_labels.values)
predictions = model_pipeline.predict(predict_data)
Pros
All data transformation can be integrated into a model pipeline and easy to maintain. You can separate differet types of data such as numerical data and categorical data and process them in different methods.
Cons
I can't find any difficulty if we used such kind of implementation on feature engineering.
More tricks in your pipeline
With above tricks, you can create a machine learning pipeline elegantly. Here I want to introduce some advanced tricks, it covers:
- how to do stacking ensemble learning in your pipeline
- how to integrate keras in your pipeline
Stacking ensemble methods in a pipeline
As you know, we usually want to use stacking method to avoid bias from one specific method. If you still a newbie of stacking learning, you can read this tutorial first. So when implementing stacking methods, the question is: how to make stacking method as one step in your pipeline? I read this material and the spirit to create the step is building a customized transformer class .
The implementation here is not the perfect one but a good starting material to let us expand.
Quickly adapt neural network model with Keras API
Keras Scikit-Learn API provides a simple way to let you integrate your neural network model with scikit learn API. You can quickly implement your keras model and integrate with your custom pipeline as one step in your pipeline object.
But there is a drawback is that the steps outside neural networks cannot be optimized by neural network. You still need to optimized your feature engineering part by yourself but it can handle data preprocessing part if you want to use neural network in your pipeline. Another point is that I haven't try to implement the pipeline if my neural network part is multiple input, so I have no idea about how to integrate multi-inputs neural network in a scikit-learn pipeline.
Conclusion
This is just simple introduction to give a thought how to do feature engineering in an elegant way. I believe there are still many awesome tricks can help us create machine learning pipelines with simple code. With survey the documentation and API design in scikit-learn, I enjoy their thoughts on machine learning development and think that is pretty worth to follow them.


One more quick question, a general question regarding sklearn pipelines rather than a specific problem I’ve encountered. Theoretically, can you use sklearn pipelines in combination with gridsearch to optimize on a certain strategy. For example with encoding strategies, is it possible to define in pipeline OrdinalEncoding OR TargetEncoding - and then run GridSearch to find the best encoding method for the problem? Or is it only possible to use pipelines to optimize the parameters within?
Don’t worry about giving a specific example, unless of course you have time. I just need to understand if it is plausible.
You can define a step in pipleine and use gridsearc to set the step to None e.g. grid = {‘encoding_step’:[Encoder(*args), None]}
Pass this to gridsearch.
Hi Bruce, thanks for the article. I’m new to sklearn pipelines and this was a really good introduction and overview. I am having some trouble with encoding using the method in this post. I was wondering if you would be able to help me. I wrote a question on stack overflow:
https://datascience.stackexchange.com/questions/61323/error-encoding-categorical-features-using-sklearn-pipelines
If you have a chance to take a look, feel free to answer on here on or stackoverflow!
Great article, thanks for posting.
I am using a similar approach with a Keras model that does take multiple inputs, to accommodate ordinal-encoded category embeddings where one code might show up in several features. It goes something like this:
At the end of the pipeline you have a list of arrays to feed to the Keras model.
Thanks Wade! Pretty interesting method, I will try it in my keras code!