None

# Meta Models¶

Certain models in scikit-lego are “meta”. Meta models are models that depend on other estimators that go in and these models will add features to the input model. One way of thinking of a meta model is to consider it to be a way to “decorate” a model.

This part of the documentation will highlight a few of them.

## Thresholder¶

The thresholder can help tweak recall and precision of a model by moving the threshold value of predict_proba. Commonly this threshold is set at 0.5 for two classes. This meta-model can decorate an estimator with two classes such that the threshold moves.

We demonstrate the working below. First we’ll generate a skewed dataset.

[31]:

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from sklearn.pipeline import Pipeline
from sklearn.datasets import make_blobs
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, make_scorer

from sklego.meta import Thresholder

[2]:

X, y = make_blobs(1000, centers=[(0, 0), (1.5, 1.5)], cluster_std=[1, 0.5])
plt.scatter(X[:, 0], X[:, 1], c=y, s=5);


Next we’ll make a cross validation pipeline to try out this thresholder.

[3]:

pipe = Pipeline([
("model", Thresholder(LogisticRegression(solver='lbfgs'), threshold=0.1))
])

mod = GridSearchCV(estimator=pipe,
param_grid = {"model__threshold": np.linspace(0.1, 0.9, 50)},
scoring={"precision": make_scorer(precision_score),
"recall": make_scorer(recall_score),
"accuracy": make_scorer(accuracy_score)},
refit="precision",
cv=5)

mod.fit(X, y);


With this cross validation trained, we’ll make a chart to show the effect of changing the threshold value.

[4]:

(pd.DataFrame(mod.cv_results_)
.set_index("param_model__threshold")
[['mean_test_precision', 'mean_test_recall', 'mean_test_accuracy']]
.plot(figsize=(16, 4)));


Increasing the threshold will increase the precision but as expected this is at the cost of recall (and accuracy).

## Grouped Prediction¶

To help explain what it can do we’ll consider three methods to predict the chicken weight. The chicken data has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups on chicks on different protein diets.

### Setup¶

Let’s first load a bunch of things to do this.

[5]:

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

from sklego.preprocessing import ColumnSelector

def plot_model(model):
model.fit(df[['diet', 'time']], df['weight'])
metric_df = df[['diet', 'time', 'weight']].assign(pred=lambda d: model.predict(d[['diet', 'time']]))
metric = mean_absolute_error(metric_df['weight'], metric_df['pred'])
plt.figure(figsize=(12, 4))
plt.scatter(df['time'], df['weight'])
for i in [1, 2, 3, 4]:
pltr = metric_df[['time', 'diet', 'pred']].drop_duplicates().loc[lambda d: d['diet'] == i]
plt.plot(pltr['time'], pltr['pred'], color='.rbgy'[i])
plt.title(f"linear model per group, MAE: {np.round(metric, 2)}");


This code will be used to explain the steps below.

### Model 1: Linear Regression with Dummies¶

[6]:

feature_pipeline = Pipeline([
("datagrab", FeatureUnion([
("discrete", Pipeline([
("grab", ColumnSelector("diet")),
("encode", OneHotEncoder(categories="auto", sparse=False))
])),
("continuous", Pipeline([
("grab", ColumnSelector("time")),
("standardize", StandardScaler())
]))
]))
])

pipe = Pipeline([
("transform", feature_pipeline),
("model", LinearRegression())
])

plot_model(pipe)


Because the model is linear the dummy variable causes the intercept to change but leaves the gradient untouched. This might not be what we want from a model. So let’s see how the grouped model can address this.

### Model 2: Linear Regression in GroupedEstimation¶

The goal of the grouped estimator is to allow us to split up our data. The image below demonstrates what will happen.

We train 5 models in total because the model will also train a fallback automatically (you can turn this off via use_fallback=False). The idea behind the fallback is that we can predict something if the group does not appear in the prediction.

Each model will accept features that are in X that are not part of the grouping variables. In this case each group will model based on the time since weight is what we’re trying to predict.

Applying this model to the dataframe is easy.

[7]:

from sklego.meta import GroupedEstimator
mod = GroupedEstimator(LinearRegression(), groups=["diet"])
plot_model(mod)


And the model looks a bit better.

### Model 3: Dummy Regression in GroupedEstimation¶

We could go a step further and train a DummyRegressor per diet per timestep. The code below works similar as the previous example but one difference is that the grouped model does not receive a dataframe but a numpy array.

Note that we’re also grouping over more than one column here. The code that does this is listed below.

[8]:

from sklearn.dummy import DummyRegressor

feature_pipeline = Pipeline([
("datagrab", FeatureUnion([
("discrete", Pipeline([
("grab", ColumnSelector("diet")),
])),
("continuous", Pipeline([
("grab", ColumnSelector("time")),
]))
]))
])

pipe = Pipeline([
("transform", feature_pipeline),
("model", GroupedEstimator(DummyRegressor(strategy="mean"), groups=[0, 1]))
])

plot_model(pipe)


Note that these predictions seems to yield the lowest error but take it with a grain of salt since these errors are only based on the train set.

## Grouped Transformation¶

We can apply grouped prediction on estimators that have a .predict() implemented but we’re also able to do something similar for transfromers, like StandardScaler.

[49]:

from sklego.datasets import load_penguins

.dropna()
.drop(columns=['island', 'bill_depth_mm', 'bill_length_mm']))

[49]:

species flipper_length_mm body_mass_g sex

Let’s say that we’re interested in scaling the numeric data in this dataframe. If we apply a normal StandardScaler then we’ll likely get clusters appear for all the species and for the sex. It may be the case (for fairness reasons) that we don’t mind the clusters based on species but that we do mind the clusters based on sex. In these scenarios the GroupedTransformer can help out. We can specify a grouping column that the data needs to be split on before the transformation is applied.

[50]:

from sklearn.preprocessing import StandardScaler
from sklego.meta import GroupedTransformer

plt.figure(figsize=(12, 6))
# we'll drop species because we're not grouping on it
df_peng = df_peng.drop(columns=["species"])

plt.subplot(121)
X     = df_peng.drop(columns=['sex']).values
X_tfm = StandardScaler().fit_transform(X)
plt.scatter(X_tfm[:, 0], X_tfm[:, 1], c=df_peng['sex'] == 'male')
plt.xlabel("norm flipper len")
plt.ylabel("norm body mass")
plt.title("scaled data, not normalised by gender")

plt.subplot(122)
X_tfm = GroupedTransformer(StandardScaler(), groups=['sex']).fit_transform(df_peng)
plt.scatter(X_tfm[:, 0], X_tfm[:, 1], c=df_peng['sex'] == 'male')
plt.xlabel("norm flipper len")
plt.ylabel("norm body mass")
plt.title("scaled data *and* normalised by gender");


You can see that there are certainly still clusters. These are caused by the fact that there’s different species of penguin in our dataset. However you can see that when we apply our GroupedTransformer that we’re suddenly able to normalise towards sex as well.

This transformer also has use-cases beyond fairness. You could use this transformer to causally compensate for subgroups in your data. For example, for predicting house prices, using the surface of a house relatively to houses in the same neighborhood could be a more relevant feature than the surface relative to all houses.

## Decayed Estimation¶

Often you are interested in predicting the future. You use the data from the past in an attempt to achieve this and it could be said that perhaps data from the far history is less relevant than data from the recent past.

This is the idea behind the DecayEstimator meta-model. It looks at the order of data going in and it will assign a higher importance to recent rows that occurred recently and a lower importance to older rows. Recency is based on the order so it is imporant that the dataset that you pass in is correctly ordered beforehand.

We’ll demonstrate how it works by applying it on a simulated timeseries problem.

[9]:

from sklearn.dummy import DummyRegressor
from sklego.meta import GroupedEstimator, DecayEstimator
from sklego.datasets import make_simpleseries

yt = make_simpleseries(seed=1)
df = (pd.DataFrame({"yt": yt,
"date": pd.date_range("2000-01-01", periods=len(yt))})
.assign(m=lambda d: d.date.dt.month)
.reset_index())

plt.figure(figsize=(12, 3))
plt.plot(make_simpleseries(seed=1));


We will create two models on this dataset. One model calculates the average value per month in our timeseries and the other does the same thing but will decay the importance of making accurate predictions for the far history.

[10]:

mod1 = (GroupedEstimator(DummyRegressor(), groups=["m"])
.fit(df[['m']], df['yt']))

mod2 = (GroupedEstimator(DecayEstimator(DummyRegressor(), decay=0.9), groups=["m"])
.fit(df[['index', 'm']], df['yt']))

plt.figure(figsize=(12, 3))
plt.plot(df['yt'], alpha=0.5);
plt.plot(mod1.predict(df[['m']]), label="grouped")
plt.plot(mod2.predict(df[['index', 'm']]), label="decayed")
plt.legend();


The decay parameter has a lot of influence on the effect of the model but one can clearly see that we shift focus to the more recent data.

## Confusion Balancer¶

Disclaimer: This is an experimental feature.

We added an experimental feature to the meta estimators that can be used to force balance in the confusion matrix of an estimator. The approach works

[11]:

n1, n2, n3 = 100, 500, 50
np.random.seed(42)
X = np.concatenate([np.random.normal(0, 1, (n1, 2)),
np.random.normal(2, 1, (n2, 2)),
np.random.normal(3, 1, (n3, 2))],
axis=0)
y = np.concatenate([np.zeros((n1, 1)),
np.ones((n2, 1)),
np.zeros((n3, 1))],
axis=0).reshape(-1)
plt.scatter(X[:, 0], X[:, 1], c=y);


Let’s take this dataset and train a simple classifier against it.

[12]:

from sklearn.metrics import confusion_matrix

[13]:

mod = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=10000)
cfm = confusion_matrix(y, mod.fit(X, y).predict(X))
cfm

[13]:

array([[ 72,  78],
[  4, 496]])


The confusion matrix is not ideal. This is in part because the dataset is slightly imbalanced but in general it is also because of the way the algorithm works. Let’s see if we can learn something else from this confusion matrix. I might transform the counts into probabilities.

[14]:

cfm.T / cfm.T.sum(axis=1).reshape(-1, 1)

[14]:

array([[0.94736842, 0.05263158],
[0.1358885 , 0.8641115 ]])


Let’s consider the number 0.2346 in the lower left corner. This number represents the probability that the actually class 0 while the model predicts class 1. In math we might write this as $$P(C_1 | M_1)$$ where $$C_i$$ denotes the actual label while $$M_i$$ denotes the label given by the algorithm.

The idea now is that we might rebalance our original predictions $$P(M_i)$$ by multiplying them;

$P_{\text{corrected}}(C_1) = P(C_1|M_0) p(M_0) + P(C_1|M_1) p(M_1)$

In general this can be written as;

$P_{\text{corrected}}(C_i) = \sum_j P(C_i|M_j) p(M_j)$

In laymens terms; we might be able to use the confusion matrix to learn from our mistakes. By how much we correct is something that we can tune with a hyperparameter.

$P_{\text{corrected}}(C_i) = \alpha \sum_j P(C_i|M_j) p(M_j) + (1-\alpha) p(M_j)$

We’ll perform an optimistic demonstration below.

[15]:

def false_positives(mod, x, y):
return (mod.predict(x) != y)[y == 1].sum()

def false_negatives(mod, x, y):
return (mod.predict(x) != y)[y == 0].sum()

[16]:

from sklego.meta import ConfusionBalancer

[17]:

cf_mod = ConfusionBalancer(LogisticRegression(solver='lbfgs', max_iter=1000), alpha=1.0)

grid = GridSearchCV(cf_mod,
param_grid={'alpha': np.linspace(-1.0, 3.0, 31)},
scoring={
"accuracy": make_scorer(accuracy_score),
"positives": false_positives,
"negatives": false_negatives
},
n_jobs=-1,
iid=True,
return_train_score=True,
refit="negatives",
cv=5)

[18]:

df = pd.DataFrame(grid.fit(X, y).cv_results_)
plt.figure(figsize=(12, 3))
plt.subplot(121)
plt.plot(df['param_alpha'], df['mean_test_positives'], label="false positives")
plt.plot(df['param_alpha'], df['mean_test_negatives'], label="false negatives")
plt.legend()
plt.subplot(122)
plt.plot(df['param_alpha'], df['mean_test_accuracy'], label="test accuracy")
plt.plot(df['param_alpha'], df['mean_train_accuracy'], label="train accuracy")
plt.legend();


It seems that we can pick a value for $$\alpha$$ such that the confusion matrix is balanced. There’s also a modest increase in accuracy for this balancing moment.

It should be emphesized though that this feature is experimental. There have been dataset/model combinations where this effect seems to work very well while there have also been situations where this trick does not work at all. It also deserves mentioning that there might be alternative to your problem. If your dataset is suffering from a huge class imbalance then you might be better off by having a look at the imbalanced-learn project.