Machine learning utilities

This module contains utilities for common machine learning tasks.

In particular, this module focuses on tasks “surrounding” machine learning, such as cross-fold splitting, performance evaluation, etc. It does not include helpers for use directly in sklearn.pipeline.Pipeline.

Creating and managing cross-validation

get_cv_folds(y, num_splits, use_stratified, …) Assign a split to each row based on the values of y
get_train_val_test_splits(df, …) Get the appropriate training, validation, and testing split masks
get_fold_data(df, target_field, m_train, …) Prepare a data frame for sklearn according to the given splits

Evaluating results

collect_binary_classification_metrics(…[, …]) Collect various binary classification performance metrics for the predictions
collect_multiclass_classification_metrics(…) Calculate various multi-class classification performance metrics
collect_regression_metrics(y_true, y_pred, …) Collect various regression performance metrics for the predictions
calc_hand_and_till_m_score(y_true, y_score) Calculate the (multi-class AUC) \(M\) score from Equation (7) of Hand and Till (2001).
calc_provost_and_domingos_auc(y_true, y_score) Calculate the (multi-class AUC) \(M\) score from Equation (7) of Provost and Domingos (2000).

Data structures

fold_data A named tuple for holding train, validation, and test datasets suitable for use in sklearn.
split_masks A named tuple for holding boolean masks for the train, validation, and test splits of a complete dataset.

Definitions

This module contains utilities for common machine learning tasks.

In particular, this module focuses on tasks “surrounding” machine learning, such as cross-fold splitting, performance evaluation, etc. It does not include helpers for use directly in sklearn.pipeline.Pipeline.

pyllars.ml_utils._calc_hand_and_till_a_value(y_true: numpy.ndarray, y_score: numpy.ndarray, i: int, j: int) → float[source]

Calculate the \(\hat{A}\) value in Equation (3) of [1]. Specifically;

\[\hat{A}(i|j) = \frac{ S_i - n_i*(n_i + 1)/2 }{n_i * n_j},\]

where \(n_i\), \(n_j\) are the count of instances of the respective classes and \(S_i\) is the (base-1) sum of the ranks of class \(i\).

Parameters:
  • y_true (numpy.ndarray) –

    The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the probabilities of the matching label.

    This should have shape (n_samples,).

  • y_score (numpy.ndarray) –

    The score predictions for each class, e.g., from pred_proba, though they are not required to be probabilities.

    This should have shape (n_samples, n_classes).

  • {i,j} (int) – The class indices
Returns:

a_hat – The \(\hat{A}\) value from Equation (3) referenced above. Specifically, this is the probability that a randomly drawn member of class \(j\) will have a lower estimated score for belonging to class \(i\) than a randomly drawn member of class \(i\).

Return type:

float

References

[1]Hand, D. & Till, R. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45, 171-186. Springer link.
pyllars.ml_utils._train_and_evaluate(estimator, X_train, y_train, X_test, y_test, target_transform, target_inverse_transform, collect_metrics, collect_metrics_kwargs, use_predict_proba)[source]

Train and evaluate estimator on the given datasets

This function is a helper for evaluate_hyperparameters. It is not intended for external use.

pyllars.ml_utils.calc_hand_and_till_m_score(y_true: numpy.ndarray, y_score: numpy.ndarray) → float[source]

Calculate the (multi-class AUC) \(M\) score from Equation (7) of Hand and Till (2001).

This is typically taken as a good multi-class extension of the AUC score. Please see [2] for more details about this score in particular and [3] for multi-class AUC in general.

N.B. In case y_score contains any np.nan values, those will be removed before calculating the \(M\) score.

N.B. This function can handle unobserved labels, except for the label with the highest index. In particular, y_score.shape[1] != np.max(np.unique(y_true)) + 1 causes an error.

Parameters:
  • y_true (numpy.ndarray) –

    The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the scores of the matching label.

    This should have shape (n_samples,).

  • y_score (numpy.ndarray) –

    The score predictions for each class, e.g., from pred_proba, though they are not required to be probabilities.

    This should have shape (n_samples, n_classes).

Returns:

m – The “multi-class AUC” score referenced above

Return type:

float

See also

_calc_hand_and_till_a_value()
for calculating the \(\hat{A}\) value

References

[2]Hand, D. & Till, R. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45, 171-186. Springer link.
[3]Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27, 861 - 874. Elsevier link.
pyllars.ml_utils.calc_provost_and_domingos_auc(y_true: numpy.ndarray, y_score: numpy.ndarray) → float[source]

Calculate the (multi-class AUC) \(M\) score from Equation (7) of Provost and Domingos (2000).

This is typically taken as a good multi-class extension of the AUC score. Please see [4] for more details about this score in particular and [5] for multi-class AUC in general.

N.B. This function can handle unobserved labels, except for the label with the highest index. In particular, y_score.shape[1] != np.max(np.unique(y_true)) + 1 causes an error.

Parameters:
  • y_true (numpy.ndarray) –

    The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the scores of the matching label.

    This should have shape (n_samples,).

  • y_score (numpy.ndarray) –

    The score predictions for each class, e.g., from pred_proba, though they are not required to be probabilities.

    This should have shape (n_samples, n_classes).

Returns:

m – The “multi-class AUC” score referenced above

Return type:

float

References

[4]Provost, F. & Domingos, P. Well-Trained PETs: Improving Probability Estimation Trees. Sterm School of Business, NYU, Sterm School of Business, NYU, 2000. Citeseer link.
[5]Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27, 861 - 874. Elsevier link.
pyllars.ml_utils.collect_binary_classification_metrics(y_true: numpy.ndarray, y_probas_pred: numpy.ndarray, threshold: float = 0.5, pos_label=1, k: int = 10, include_roc_curve: bool = True, include_pr_curve: bool = True, prefix: str = '') → Dict[source]

Collect various binary classification performance metrics for the predictions

Parameters:
  • y_true (numpy.ndarray) –

    The true class of each instance.

    This should have shape (n_samples,).

  • y_probas_pred (numpy.ndarray) –

    The score of each prediction for each instance.

    This should have shape (n_samples, n_classes).

  • threshold (float) – The score threshold to choose “positive” predictions
  • pos_label (str or int) – The “positive” class for some metrics
  • k (int) – The value of k to use for precision_at_k
  • include_roc_curve (bool) – Whether to include the fpr and trp points necessary to draw a roc curve
  • include_pr_curve (bool) – Whether to include details on the precision-recall curve
  • prefix (str) – An optional prefix for the keys in the metrics dictionary
Returns:

metrics – A mapping from the metric name to the respective value. Currently, the following metrics are included:

Return type:

dict

pyllars.ml_utils.collect_multiclass_classification_metrics(y_true: numpy.ndarray, y_score: numpy.ndarray, prefix: str = '') → Dict[source]

Calculate various multi-class classification performance metrics

Parameters:
  • y_true (numpy.ndarray) –

    The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the scores of the matching label.

    This should have shape (n_samples,).

  • y_score (numpy.ndarray) –

    The score predictions for each class, e.g., from` pred_proba`, though they are not required to be probabilities.

    This should have shape (n_samples, n_classes).

  • prefix (str) – An optional prefix for the keys in the metrics dictionary
Returns:

metrics – A mapping from the metric name to the respective value. Currently, the following metrics are included:

Return type:

typing.Dict

pyllars.ml_utils.collect_regression_metrics(y_true: numpy.ndarray, y_pred: numpy.ndarray, prefix: str = '') → Dict[source]

Collect various regression performance metrics for the predictions

Parameters:
  • y_true (numpy.ndarray) – The true value of each instance
  • y_pred (numpy.ndarray) – The prediction for each instance
  • prefix (str) – An optional prefix for the keys in the metrics dictionary
Returns:

metrics – A mapping from the metric name to the respective value. Currently, the following metrics are included:

Return type:

typing.Dict

class pyllars.ml_utils.estimators_predictions_metrics[source]

A named tuple for holding fit estimators, predictions on the respective datasets, and results.

estimator_{val,test}

Estimators fit on the respective datasets.

Type:sklearn.base.BaseEstimators
predictions_{val,test}

Predictions of the respective models.

Type:numpy.ndarray
metrics_{val,test}

Metrics for the respective datasets.

Type:typing.Dict
fold_{train,val,test}

The identifiers of the respective folds.

Type:typing.Any
hyperparameters{_str}

The hyperparameters (in a string format) for training the models.

Type:typing.Optional[typing.Dict]
_asdict()

Return a new OrderedDict which maps field names to their values.

classmethod _make(iterable, new=<built-in method __new__ of type object>, len=<built-in function len>)

Make a new estimators_predictions_metrics object from a sequence or iterable

_replace(**kwds)

Return a new estimators_predictions_metrics object replacing specified fields with new values

estimator_test

Alias for field number 1

estimator_val

Alias for field number 0

fold_test

Alias for field number 10

fold_train

Alias for field number 8

fold_val

Alias for field number 9

hyperparameters

Alias for field number 11

hyperparameters_str

Alias for field number 12

metrics_test

Alias for field number 7

metrics_val

Alias for field number 6

predictions_test

Alias for field number 3

predictions_val

Alias for field number 2

true_test

Alias for field number 5

true_val

Alias for field number 4

pyllars.ml_utils.evaluate_hyperparameters(estimator_template: sklearn.base.BaseEstimator, hyperparameters: Dict, validation_folds: Any, test_folds: Any, data: pandas.core.frame.DataFrame, collect_metrics: Callable, use_predict_proba: bool = False, train_folds: Optional[Any] = None, split_field: str = 'fold', target_field: str = 'target', target_transform: Optional[Callable] = None, target_inverse_transform: Optional[Callable] = None, collect_metrics_kwargs: Optional[Dict] = None, attribute_fields: Optional[Iterable[str]] = None, fields_to_ignore: Optional[Container[str]] = None, attributes_are_np_arrays: bool = False) → pyllars.ml_utils.estimators_predictions_metrics[source]

Evaluate hyperparameters for fold

N.B. This function is not particularly efficient with creating copies of data.

This function performs the following steps:

  1. Create estimator_val and estimator_test based on estimator_template and hyperparameters
  2. Split data into train, val, test based on validation_fold and test_fold
  3. Transform target_field using the target_transform function
  4. Train estimator_val using train
  5. Evaluate the trained estimator_val on val using collect_metrics
  6. Train estimator_test using both train and val
  7. Evaluate the trained estimator_test on test using collect_metrics
Parameters:
  • estimator_template (sklearn.base.BaseEstimator) – The template for creating the estimator.
  • hyperparameters (typing.Dict) – The hyperparameters for the model. These should be compatible with estimator_template.set_params.
  • validation_folds (typing.Any) – The fold(s) to use for validation. The validation fold will be selected based on isin. If validation_fold is not a container, it will be cast as one.
  • test_folds (typing.Any) – The fold(s) to use for testing. The test fold will be selected based on isin. If test_fold is not a container, it will be cast as one.
  • data (pandas.DataFrame) – The data.
  • collect_metrics (typing.Callable) – The function for evaluating the model performance. It should have at least two arguments, y_true and y_pred, in that order. This function will eventually return whatever this function returns.
  • use_predict_proba (bool) – Whether to use predict (when False, the default) or predict_proba on the trained model.
  • train_folds (typing.Optional[typing.Any]) – The fold(s) to use for training. If not given, the training fold will be taken as all rows in data which are not part of the validation or testing set.
  • split_field (str) – The name of the column with the fold identifiers
  • target_field (str) – The name of the column with the target value
  • target_transform (typing.Optional[typing.Callable]) – A function for transforming the target before training models. Example: numpy.log1p()
  • target_inverse_transform (typing.Optional[typing.Callable]) – A function for transforming model predictions back to the original domain. This should be a mathematical inverse of target_transform. Example: numpy.expm1() is the inverse of numpy.log1p().
  • collect_metrics_kwargs (typing.Optional[typing.Dict]) – Additional keyword arguments for collect_metrics.
  • attribute_fields (typing.Optional[typing.Iterable[str]]) – The names of the columns to use for attributes (that is, X). If None (default), then all columns except the target_field will be used as attributes.
  • fields_to_ignore (typing.Optional[typing.Container[str]]) – The names of the columns to ignore.
  • attributes_are_np_arrays (bool) – Whether to stack the values from the individual rows. This should be set to True when some of the columns in attribute_fields contain numpy arrays.
Returns:

estimators_predictions_metrics – The fit estimators, predictions on the respective datasets, and results from collect_metrics.

Return type:

typing.NamedTuple

pyllars.ml_utils.get_cv_folds(y: numpy.ndarray, num_splits: int = 10, use_stratified: bool = True, shuffle: bool = True, random_state: int = 8675309) → numpy.ndarray[source]

Assign a split to each row based on the values of y

Parameters:
  • y (numpy.ndarray) – The target variable for each row in a data frame. This is used to determine the stratification.
  • num_splits (int) – The number of stratified splits to use
  • use_stratified (bool) – Whether to use stratified cross-validation. For example, this may be set to False if choosing folds for regression.
  • shuffle (bool) – Whether to shuffle during the split
  • random_state (int) – The state for the random number generator
Returns:

splits – The split of each row

Return type:

numpy.ndarray

pyllars.ml_utils.get_fold_data(df: pandas.core.frame.DataFrame, target_field: str, m_train: numpy.ndarray, m_test: numpy.ndarray, m_validation: Optional[numpy.ndarray] = None, attribute_fields: Optional[Iterable[str]] = None, fields_to_ignore: Optional[Iterable[str]] = None, attributes_are_np_arrays: bool = False) → pyllars.ml_utils.fold_data[source]

Prepare a data frame for sklearn according to the given splits

N.B. This function creates copies of the data, so it is not appropriate for very large datasets.

Parameters:
  • df (pandas.DataFrame) – A data frame
  • target_field (str) – The name of the column containing the target variable
  • m_{train,test,validation} (np.ndarray) – Boolean masks indicating the training, testing, and validation set rows. If m_validation is None (default), then no validation set will be included.
  • attribute_fields (typing.Optional[typing.Iterable[str]]) – The names of the columns to use for attributes (that is, X). If None (default), then all columns except the target_field will be used as attributes.
  • fields_to_ignore (typing.Optional[typing.Container[str]]) – The names of the columns to ignore.
  • attributes_are_np_arrays (bool) – Whether to stack the values from the individual rows. This should be set to True when some of the columns in attribute_fields contain numpy arrays.
Returns:

fold_data – A named tuple with the given splits

Return type:

pyllars.ml_utils.fold_data

pyllars.ml_utils.get_train_val_test_splits(df: pandas.core.frame.DataFrame, training_splits: Optional[Set] = None, validation_splits: Optional[Set] = None, test_splits: Optional[Set] = None, split_field: str = 'split') → pyllars.ml_utils.split_masks[source]

Get the appropriate training, validation, and testing split masks

The split_field column in df is used to assign each row to a particular split. Then, the splits specified in the parameters are assigned as indicated.

By default, all splits not in validation_splits and test_splits are assumed to belong to the training set. Thus, unless a particular training set is given, the returned masks will cover the entire dataset.

This function does not check whether the different splits overlap. So care should be taken, especially if specifying the training splits explicitly.

It is not necessary that the split_field values are numeric. They must be compatible with isin, however.

Parameters:
  • df (pandas.DataFrame) – A data frame. It must contain a column named split_field, but it is not otherwise validated.
  • training_splits (typing.Optional[typing.Set]) –

    The splits to use for the training set. By default, anything not in the validation_splits or test_splits will be placed in the training set.

    If given, this container must be compatible with isin. Otherwise, it will be wrapped in a set.

  • {validation,test}_splits (typing.Optional[typing.Set]) –

    The splits to use for the validation and test sets, respectively.

    If given, this container must be compatible with isin. Otherwise, it will be wrapped in a set.

  • split_field (str) – The name of the column indicating the split for each row.
Returns:

split_masks – Masks for the respective sets. True positions indicate the rows which belong to the respective sets. All three masks are always returned, but a mask may be always False if the given split does not contain any rows.

Return type:

pyllars.ml_utils.split_masks

pyllars.ml_utils.precision_at_k(y_true, y_score, k=10, pos_label=1)[source]

Precision at rank k

This code was adapted from this gist: https://gist.github.com/mblondel/7337391

Parameters:
  • y_true (array-like, shape = [n_samples]) – Ground truth (true relevance labels).
  • y_score (array-like, shape = [n_samples]) – Predicted scores.
  • k (int) – Rank.
  • pos_label (int) – The label for “positive” instances
Returns:

precision @k

Return type:

float

class pyllars.ml_utils.fold_data[source]

A named tuple for holding train, validation, and test datasets suitable for use in sklearn.

This class can be more convenient than pyllars.ml_utils.split_masks for modest-sized datasets.

X_{train,test,validation}

The X data (features) for the respective dataset splits

Type:numpy.ndarray
y_{train,test,validation}

The y data (target) for the respective dataset splits

Type:numpy.ndarray
{train_test,validation}_indices

The row indices from the original dataset of the respective dataset splits

Type:numpy.ndarray
class pyllars.ml_utils.split_masks[source]

A named tuple for holding boolean masks for the train, validation, and test splits of a complete dataset.

These masks can be used to index numpy.ndarray or pandas.DataFrame objects to extract the relevant dataset split for sklearn. This class can be more appropriate than pyllars.ml_utils.fold_data for large objects since it avoids any copies of the data.

training,test,validation

Boolean masks for the respective dataset splits

Type:numpy.ndarray