Model Selection

class sklego.model_selection.KlusterFoldValidation(cluster_method=None)[source]

Bases: object

KlusterFold cross validator

  • Create folds based on provided cluster method
Parameters:cluster_method – Clustering method with fit_predict attribute
split(X, y=None, groups=None)[source]

Generator to iterate over the indices

  • X – Array to split on
  • y – Always ignored, exists for compatibility
  • groups – Always ignored, exists for compatibility
class sklego.model_selection.TimeGapSplit(date_serie, valid_duration, train_duration=None, gap_duration=datetime.timedelta(0), n_splits=None, window='rolling')[source]

Bases: object

Provides train/test indices to split time series data samples. This cross-validation object is a variation of TimeSeriesSplit with the following differences: - The splits are made based on datetime duration, instead of number of rows. - The user specifies the validation durations and either training_duration or n_splits - The user can specify a ‘gap’ duration that is added

after the training split and before the validation split

The 3 duration parameters can be used to really replicate how the model is going to be used in production in batch learning. Each validation fold doesn’t overlap. The entire ‘window’ moves by 1 valid_duration until there is not enough data. If this would lead to more splits then specified with n_splits, the ‘window’ moves by the validation_duration times the fraction of possible splits and requested splits

– n_possible_splits = (total_length-train_duration-gap_duration)//valid_duration – time_shift = valid_duratiopn n_possible_splits/n_slits

so the CV spans the whole dataset. If train_duration is not passed but n_split is, the training duration is increased to

– train_duration = total_length-(self.gap_duration + self.valid_duration * self.n_splits) such that the shifting the entire window by one validation duration spans the whole training set
  • date_serie (pandas.Series) – Series with the date, that should have all the indices of X used in split()
  • train_duration (datetime.timedelta) – historical training data.
  • valid_duration (datetime.timedelta) – retraining period.
  • gap_duration (datetime.timedelta) – forward looking window of the target. The period of the forward looking window necessary to create your target variable. This period is dropped at the end of your training folds due to lack of recent data. In production you would have not been able to create the target for that period, and you would have drop it from the training data.
  • n_splits (int) – number of splits
  • window (string) – ‘rolling’ window has fixed size and is shifted entirely ‘expanding’ left side of window is fixed, right border increases each fold
get_n_splits(X=None, y=None, groups=None)[source]
split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set. :param pandas.DataFrame X: :param y: Always ignored, exists for compatibility :param groups: Always ignored, exists for compatibility


Describe all folds :param pandas.DataFrame X: :returns: pd.DataFrame summary of all folds