Meta Models

Certain models in scikit-lego are “meta”. Meta models are models that depend on other estimators that go in and these models will add features to the input model. One way of thinking of a meta model is to consider it to be a way to “decorate” a model.

This part of the documentation will highlight a few of them.

Thresholder

The thresholder can help tweak recall and precision of a model by moving the threshold value of predict_proba. Commonly this threshold is set at 0.5 for two classes. This meta-model can decorate an estimator with two classes such that the threshold moves.

We demonstrate the working below. First we’ll generate a skewed dataset.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from sklearn.pipeline import Pipeline
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer

from sklego.meta import Thresholder
[2]:
X, y = make_blobs(1000, centers=[(0, 0), (1.5, 1.5)], cluster_std=[1, 0.5])
plt.scatter(X[:, 0], X[:, 1], c=y, s=5);
_images/meta_2_0.png

Next we’ll make a cross validation pipeline to try out this thresholder.

[3]:
pipe = Pipeline([
    ("model", Thresholder(LogisticRegression(solver='lbfgs'), threshold=0.1))
])

mod = GridSearchCV(estimator=pipe,
                   param_grid = {"model__threshold": np.linspace(0.1, 0.9, 50)},
                   scoring={"precision": make_scorer(precision_score),
                            "recall": make_scorer(recall_score),
                            "accuracy": make_scorer(accuracy_score)},
                   refit="precision",
                   cv=5)

mod.fit(X, y);

With this cross validation trained, we’ll make a chart to show the effect of changing the threshold value.

[4]:
(pd.DataFrame(mod.cv_results_)
 .set_index("param_model__threshold")
 [['mean_test_precision', 'mean_test_recall', 'mean_test_accuracy']]
 .plot(figsize=(16, 4)));
_images/meta_6_0.png

Increasing the threshold will increase the precision but as expected this is at the cost of recall (and accuracy).

Grouped Estimation

To help explain what it can do we’ll consider three methods to predict the chicken weight. The chicken data has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups on chicks on different protein diets.

Setup

Let’s first load a bunch of things to do this.

[5]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

from sklego.datasets import load_chicken
from sklego.preprocessing import ColumnSelector

df = load_chicken(give_pandas=True)

def plot_model(model):
    df = load_chicken(give_pandas=True)
    model.fit(df[['diet', 'time']], df['weight'])
    metric_df = df[['diet', 'time', 'weight']].assign(pred=lambda d: model.predict(d[['diet', 'time']]))
    metric = mean_absolute_error(metric_df['weight'], metric_df['pred'])
    plt.figure(figsize=(12, 4))
    plt.scatter(df['time'], df['weight'])
    for i in [1, 2, 3, 4]:
        pltr = metric_df[['time', 'diet', 'pred']].drop_duplicates().loc[lambda d: d['diet'] == i]
        plt.plot(pltr['time'], pltr['pred'], color='.rbgy'[i])
    plt.title(f"linear model per group, MAE: {np.round(metric, 2)}");

This code will be used to explain the steps below.

Model 1: Linear Regression with Dummies

First we start with a baseline.

[6]:
feature_pipeline = Pipeline([
    ("datagrab", FeatureUnion([
         ("discrete", Pipeline([
             ("grab", ColumnSelector("diet")),
             ("encode", OneHotEncoder(categories="auto", sparse=False))
         ])),
         ("continous", Pipeline([
             ("grab", ColumnSelector("time")),
             ("standardize", StandardScaler())
         ]))
    ]))
])

pipe = Pipeline([
    ("transform", feature_pipeline),
    ("model", LinearRegression())
])

plot_model(pipe)
_images/meta_10_0.png

Because the model is linear the dummy variable causes the intercept to change but leaves the gradient untouched. This might not be what we want from a model. So let’s see how the grouped model can adress this.

Model 2: Linear Regression in GroupedEstimation

The goal of the grouped estimator is to allow us to split up our data. The image below demonstrates what will happen.

We train 5 models in total because the model will also train a fallback automatically (you can turn this off via use_fallback=False). The idea behind the fallback is that we can predict something if the group does not appear in the prediction.

Each model will accept features that are in X that are not part of the grouping variables. In this case each group will model based on the time since weight is what we’re trying to predict.

Applying this model to the dataframe is easy.

[7]:
from sklego.meta import GroupedEstimator
mod = GroupedEstimator(LinearRegression(), groups=["diet"])
plot_model(mod)
_images/meta_12_0.png

And the model looks a bit better.

Model 3: Dummy Regression in GroupedEstimation

We could go a step further and train a DummyRegressor per diet per timestep. The code below works similar as the previous example but one difference is that the grouped model does not receive a dataframe but a numpy array.

Note that we’re also grouping over more than one column here. The code that does this is listed below.

[8]:
from sklearn.dummy import DummyRegressor

feature_pipeline = Pipeline([
    ("datagrab", FeatureUnion([
         ("discrete", Pipeline([
             ("grab", ColumnSelector("diet")),
         ])),
         ("continous", Pipeline([
             ("grab", ColumnSelector("time")),
         ]))
    ]))
])

pipe = Pipeline([
    ("transform", feature_pipeline),
    ("model", GroupedEstimator(DummyRegressor(strategy="mean"), groups=[0, 1]))
])

plot_model(pipe)
_images/meta_14_0.png

Note that these predictions seems to yield the lowest error but take it with a grain of salt since these errors are only based on the train set.

Decayed Estimation

Often you are interested in predicting the future. You use the data from the past in an attempt to achieve this and it could be said that perhaps data from the far history is less relevant than data from the recent past.

This is the idea behind the DecayEstimator meta-model. It looks at the order of data going in and it will assign a higher importance to recent rows that occurred recently and a lower importance to older rows. Recency is based on the order so it is imporant that the dataset that you pass in is correctly ordered beforehand.

We’ll demonstrate how it works by applying it on a simulated timeseries problem.

[9]:
from sklearn.dummy import DummyRegressor
from sklego.meta import GroupedEstimator, DecayEstimator
from sklego.datasets import make_simpleseries

yt = make_simpleseries(seed=1)
df = (pd.DataFrame({"yt": yt,
                   "date": pd.date_range("2000-01-01", periods=len(yt))})
      .assign(m=lambda d: d.date.dt.month)
      .reset_index())

plt.figure(figsize=(12, 3))
plt.plot(make_simpleseries(seed=1));
_images/meta_16_0.png

We will create two models on this dataset. One model calculates the average value per month in our timeseries and the other does the same thing but will decay the importance of making accurate predictions for the far history.

[10]:
mod1 = (GroupedEstimator(DummyRegressor(), groups=["m"])
        .fit(df[['m']], df['yt']))


mod2 = (GroupedEstimator(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
        .fit(df[['index', 'm']], df['yt']))

plt.figure(figsize=(12, 3))
plt.plot(df['yt'], alpha=0.5);
plt.plot(mod1.predict(df[['m']]), label="grouped")
plt.plot(mod2.predict(df[['index', 'm']]), label="decayed")
plt.legend();
_images/meta_18_0.png

The decay parameter has a lot of influence on the effect of the model but one can clearly see that we shift focus to the more recent data.

Confusion Balancer

Disclaimer: This is an experimental feature.

We added an experimental feature to the meta estimators that can be used to force balance in the confusion matrix of an estimator. The approach works

[11]:
n1, n2, n3 = 100, 500, 50
np.random.seed(42)
X = np.concatenate([np.random.normal(0, 1, (n1, 2)),
                    np.random.normal(2, 1, (n2, 2)),
                    np.random.normal(3, 1, (n3, 2))],
                   axis=0)
y = np.concatenate([np.zeros((n1, 1)),
                    np.ones((n2, 1)),
                    np.zeros((n3, 1))],
                   axis=0).reshape(-1)
plt.scatter(X[:, 0], X[:, 1], c=y);
_images/meta_20_0.png

Let’s take this dataset and train a simple classifier against it.

[12]:
from sklearn.metrics import confusion_matrix
[13]:
mod = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=10000)
cfm = confusion_matrix(y, mod.fit(X, y).predict(X))
cfm
[13]:
array([[ 72,  78],
       [  4, 496]])

The confusion matrix is not ideal. This is in part because the dataset is slightly inbalanced but in general it is also because of the way the algorithm works. Let’s see if we can learn something else from this confusion matrix. I might transform the counts into probabilities.

[14]:
cfm.T / cfm.T.sum(axis=1).reshape(-1, 1)
[14]:
array([[0.94736842, 0.05263158],
       [0.1358885 , 0.8641115 ]])

Let’s consider the number 0.2346 in the lower left corner. This number represents the probability that the actually class 0 while the model predicts class 1. In math we might write this as \(P(C_1 | M_1)\) where \(C_i\) denotes the actual label while \(M_i\) denotes the label given by the algorithm.

The idea now is that we might rebalance our original predictions \(P(M_i)\) by multiplying them;

\[P_{\text{corrected}}(C_1) = P(C_1|M_0) p(M_0) + P(C_1|M_1) p(M_1)\]

In general this can be written as;

\[P_{\text{corrected}}(C_i) = \sum_j P(C_i|M_j) p(M_j)\]

In laymens terms; we might be able to use the confusion matrix to learn from our mistakes. By how much we correct is something that we can tune with a hyperparameter.

\[P_{\text{corrected}}(C_i) = \alpha \sum_j P(C_i|M_j) p(M_j) + (1-\alpha) p(M_j)\]

We’ll perform an optimistic demonstration below.

[15]:
def false_positives(mod, x, y):
    return (mod.predict(x) != y)[y == 1].sum()

def false_negatives(mod, x, y):
    return (mod.predict(x) != y)[y == 0].sum()
[16]:
from sklego.meta import ConfusionBalancer
[17]:
cf_mod = ConfusionBalancer(LogisticRegression(solver='lbfgs', max_iter=1000), alpha=1.0)

grid = GridSearchCV(cf_mod,
                    param_grid={'alpha': np.linspace(-1.0, 3.0, 31)},
                    scoring={
                        "accuracy": make_scorer(accuracy_score),
                        "positives": false_positives,
                        "negatives": false_negatives
                    },
                    n_jobs=-1,
                    iid=True,
                    return_train_score=True,
                    refit="negatives",
                    cv=5)
[18]:
df = pd.DataFrame(grid.fit(X, y).cv_results_)
plt.figure(figsize=(12, 3))
plt.subplot(121)
plt.plot(df['param_alpha'], df['mean_test_positives'], label="false positives")
plt.plot(df['param_alpha'], df['mean_test_negatives'], label="false negatives")
plt.legend()
plt.subplot(122)
plt.plot(df['param_alpha'], df['mean_test_accuracy'], label="test accurracy")
plt.plot(df['param_alpha'], df['mean_train_accuracy'], label="train accurracy")
plt.legend();
_images/meta_30_0.png

It seems that we can pick a value for \(\alpha\) such that the confusion matrix is balanced. There’s also a modest increase in accuracy for this balancing moment.

It should be emphesized though that this feature is experimental. There have been dataset/model combinations where this effect seems to work very well while there have also been situations where this trick does not work at all. It also deserves mentioning that there might be alternative to your problem. If your dataset is suffering from a huge class imbalance then you might be better off by having a look at the imbalanced-learn project.