Machine learning utilities¶
This module contains utilities for common machine learning tasks.
In particular, this module focuses on tasks “surrounding” machine learning,
such as cross-fold splitting, performance evaluation, etc. It does not include
helpers for use directly in sklearn.pipeline.Pipeline
.
Creating and managing cross-validation¶
get_cv_folds (y, num_splits, use_stratified, …) |
Assign a split to each row based on the values of y |
get_train_val_test_splits (df, …) |
Get the appropriate training, validation, and testing split masks |
get_fold_data (df, target_field, m_train, …) |
Prepare a data frame for sklearn according to the given splits |
Evaluating results¶
collect_binary_classification_metrics (…[, …]) |
Collect various binary classification performance metrics for the predictions |
collect_multiclass_classification_metrics (…) |
Calculate various multi-class classification performance metrics |
collect_regression_metrics (y_true, y_pred, …) |
Collect various regression performance metrics for the predictions |
calc_hand_and_till_m_score (y_true, y_score) |
Calculate the (multi-class AUC) \(M\) score from Equation (7) of Hand and Till (2001). |
calc_provost_and_domingos_auc (y_true, y_score) |
Calculate the (multi-class AUC) \(M\) score from Equation (7) of Provost and Domingos (2000). |
Data structures¶
fold_data |
A named tuple for holding train, validation, and test datasets suitable for use in sklearn. |
split_masks |
A named tuple for holding boolean masks for the train, validation, and test splits of a complete dataset. |
Definitions¶
This module contains utilities for common machine learning tasks.
In particular, this module focuses on tasks “surrounding” machine learning,
such as cross-fold splitting, performance evaluation, etc. It does not include
helpers for use directly in sklearn.pipeline.Pipeline
.
-
pyllars.ml_utils.
_calc_hand_and_till_a_value
(y_true: numpy.ndarray, y_score: numpy.ndarray, i: int, j: int) → float[source]¶ Calculate the \(\hat{A}\) value in Equation (3) of [1]. Specifically;
\[\hat{A}(i|j) = \frac{ S_i - n_i*(n_i + 1)/2 }{n_i * n_j},\]where \(n_i\), \(n_j\) are the count of instances of the respective classes and \(S_i\) is the (base-1) sum of the ranks of class \(i\).
Parameters: - y_true (numpy.ndarray) –
The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the probabilities of the matching label.
This should have shape (n_samples,).
- y_score (numpy.ndarray) –
The score predictions for each class, e.g., from pred_proba, though they are not required to be probabilities.
This should have shape (n_samples, n_classes).
- {i,j} (int) – The class indices
Returns: a_hat – The \(\hat{A}\) value from Equation (3) referenced above. Specifically, this is the probability that a randomly drawn member of class \(j\) will have a lower estimated score for belonging to class \(i\) than a randomly drawn member of class \(i\).
Return type: References
[1] Hand, D. & Till, R. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45, 171-186. Springer link. - y_true (numpy.ndarray) –
-
pyllars.ml_utils.
_train_and_evaluate
(estimator, X_train, y_train, X_test, y_test, target_transform, target_inverse_transform, collect_metrics, collect_metrics_kwargs, use_predict_proba)[source]¶ Train and evaluate estimator on the given datasets
This function is a helper for evaluate_hyperparameters. It is not intended for external use.
-
pyllars.ml_utils.
calc_hand_and_till_m_score
(y_true: numpy.ndarray, y_score: numpy.ndarray) → float[source]¶ Calculate the (multi-class AUC) \(M\) score from Equation (7) of Hand and Till (2001).
This is typically taken as a good multi-class extension of the AUC score. Please see [2] for more details about this score in particular and [3] for multi-class AUC in general.
N.B. In case y_score contains any np.nan values, those will be removed before calculating the \(M\) score.
N.B. This function can handle unobserved labels, except for the label with the highest index. In particular,
y_score.shape[1] != np.max(np.unique(y_true)) + 1
causes an error.Parameters: - y_true (numpy.ndarray) –
The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the scores of the matching label.
This should have shape (n_samples,).
- y_score (numpy.ndarray) –
The score predictions for each class, e.g., from pred_proba, though they are not required to be probabilities.
This should have shape (n_samples, n_classes).
Returns: m – The “multi-class AUC” score referenced above
Return type: See also
_calc_hand_and_till_a_value()
- for calculating the \(\hat{A}\) value
References
[2] Hand, D. & Till, R. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001, 45, 171-186. Springer link. [3] Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27, 861 - 874. Elsevier link. - y_true (numpy.ndarray) –
-
pyllars.ml_utils.
calc_provost_and_domingos_auc
(y_true: numpy.ndarray, y_score: numpy.ndarray) → float[source]¶ Calculate the (multi-class AUC) \(M\) score from Equation (7) of Provost and Domingos (2000).
This is typically taken as a good multi-class extension of the AUC score. Please see [4] for more details about this score in particular and [5] for multi-class AUC in general.
N.B. This function can handle unobserved labels, except for the label with the highest index. In particular,
y_score.shape[1] != np.max(np.unique(y_true)) + 1
causes an error.Parameters: - y_true (numpy.ndarray) –
The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the scores of the matching label.
This should have shape (n_samples,).
- y_score (numpy.ndarray) –
The score predictions for each class, e.g., from pred_proba, though they are not required to be probabilities.
This should have shape (n_samples, n_classes).
Returns: m – The “multi-class AUC” score referenced above
Return type: References
[4] Provost, F. & Domingos, P. Well-Trained PETs: Improving Probability Estimation Trees. Sterm School of Business, NYU, Sterm School of Business, NYU, 2000. Citeseer link. [5] Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27, 861 - 874. Elsevier link. - y_true (numpy.ndarray) –
-
pyllars.ml_utils.
collect_binary_classification_metrics
(y_true: numpy.ndarray, y_probas_pred: numpy.ndarray, threshold: float = 0.5, pos_label=1, k: int = 10, include_roc_curve: bool = True, include_pr_curve: bool = True, prefix: str = '') → Dict[source]¶ Collect various binary classification performance metrics for the predictions
Parameters: - y_true (numpy.ndarray) –
The true class of each instance.
This should have shape (n_samples,).
- y_probas_pred (numpy.ndarray) –
The score of each prediction for each instance.
This should have shape (n_samples, n_classes).
- threshold (float) – The score threshold to choose “positive” predictions
- pos_label (str or int) – The “positive” class for some metrics
- k (int) – The value of k to use for precision_at_k
- include_roc_curve (bool) – Whether to include the fpr and trp points necessary to draw a roc curve
- include_pr_curve (bool) – Whether to include details on the precision-recall curve
- prefix (str) – An optional prefix for the keys in the metrics dictionary
Returns: metrics – A mapping from the metric name to the respective value. Currently, the following metrics are included:
sklearn.metrics.cohen_kappa_score()
sklearn.metrics.hinge_loss()
sklearn.metrics.matthews_corrcoef()
sklearn.metrics.accuracy_score()
sklearn.metrics.f1_score()
(binary)sklearn.metrics.f1_score()
(macro)sklearn.metrics.f1_score()
(micro)sklearn.metrics.hamming_loss()
sklearn.metrics.jaccard_score()
sklearn.metrics.log_loss()
sklearn.metrics.precision_score()
(binary)sklearn.metrics.precision_score()
(macro)sklearn.metrics.precision_score()
(micro)sklearn.metrics.recall_score()
(binary)sklearn.metrics.recall_score()
(macro)sklearn.metrics.recall_score()
(micro)sklearn.metrics.zero_one_loss()
sklearn.metrics.average_precision_score()
(macro)sklearn.metrics.average_precision_score()
(micro)sklearn.metrics.roc_auc_score()
(macro)sklearn.metrics.roc_auc_score()
(micro)pyllars.ml_utils.precision_at_k()
- auprc: area under the PR curve
- minpse: See [Harutyunyan et al., 2019] for details
- roc_ {fpr, tpr, thresholds}:
sklearn.metrics.roc_curve()
- pr_ {precisions, recalls, thresholds}:
sklearn.metrics.precision_recall_curve()
Return type: - y_true (numpy.ndarray) –
-
pyllars.ml_utils.
collect_multiclass_classification_metrics
(y_true: numpy.ndarray, y_score: numpy.ndarray, prefix: str = '') → Dict[source]¶ Calculate various multi-class classification performance metrics
Parameters: - y_true (numpy.ndarray) –
The true label of each instance. The labels are assumed to be encoded with integers [0, 1, … n_classes-1]. The respective columns in y_score should give the scores of the matching label.
This should have shape (n_samples,).
- y_score (numpy.ndarray) –
The score predictions for each class, e.g., from` pred_proba`, though they are not required to be probabilities.
This should have shape (n_samples, n_classes).
- prefix (str) – An optional prefix for the keys in the metrics dictionary
Returns: metrics – A mapping from the metric name to the respective value. Currently, the following metrics are included:
sklearn.metrics.cohen_kappa_score()
sklearn.metrics.accuracy_score()
sklearn.metrics.f1_score()
(micro)sklearn.metrics.f1_score()
(macro)sklearn.metrics.hamming_loss()
sklearn.metrics.precision_score()
(micro)sklearn.metrics.precision_score()
(macro)sklearn.metrics.recall_score()
(micro)sklearn.metrics.recall_score()
(macro)pyllars.ml_utils.calc_hand_and_till_m_score()
pyllars.ml_utils.calc_provost_and_domingos_auc()
Return type: - y_true (numpy.ndarray) –
-
pyllars.ml_utils.
collect_regression_metrics
(y_true: numpy.ndarray, y_pred: numpy.ndarray, prefix: str = '') → Dict[source]¶ Collect various regression performance metrics for the predictions
Parameters: - y_true (numpy.ndarray) – The true value of each instance
- y_pred (numpy.ndarray) – The prediction for each instance
- prefix (str) – An optional prefix for the keys in the metrics dictionary
Returns: metrics – A mapping from the metric name to the respective value. Currently, the following metrics are included:
Return type:
-
class
pyllars.ml_utils.
estimators_predictions_metrics
[source]¶ A named tuple for holding fit estimators, predictions on the respective datasets, and results.
-
estimator_{val,test}
Estimators fit on the respective datasets.
Type: sklearn.base.BaseEstimators
-
predictions_{val,test}
Predictions of the respective models.
Type: numpy.ndarray
-
metrics_{val,test}
Metrics for the respective datasets.
Type: typing.Dict
-
fold_{train,val,test}
The identifiers of the respective folds.
Type: typing.Any
-
hyperparameters{_str}
The hyperparameters (in a string format) for training the models.
Type: typing.Optional[typing.Dict]
-
_asdict
()¶ Return a new OrderedDict which maps field names to their values.
-
classmethod
_make
(iterable, new=<built-in method __new__ of type object>, len=<built-in function len>)¶ Make a new estimators_predictions_metrics object from a sequence or iterable
-
_replace
(**kwds)¶ Return a new estimators_predictions_metrics object replacing specified fields with new values
-
estimator_test
¶ Alias for field number 1
-
estimator_val
¶ Alias for field number 0
-
fold_test
¶ Alias for field number 10
-
fold_train
¶ Alias for field number 8
-
fold_val
¶ Alias for field number 9
-
hyperparameters
¶ Alias for field number 11
-
hyperparameters_str
¶ Alias for field number 12
-
metrics_test
¶ Alias for field number 7
-
metrics_val
¶ Alias for field number 6
-
predictions_test
¶ Alias for field number 3
-
predictions_val
¶ Alias for field number 2
-
true_test
¶ Alias for field number 5
-
true_val
¶ Alias for field number 4
-
-
pyllars.ml_utils.
evaluate_hyperparameters
(estimator_template: sklearn.base.BaseEstimator, hyperparameters: Dict, validation_folds: Any, test_folds: Any, data: pandas.core.frame.DataFrame, collect_metrics: Callable, use_predict_proba: bool = False, train_folds: Optional[Any] = None, split_field: str = 'fold', target_field: str = 'target', target_transform: Optional[Callable] = None, target_inverse_transform: Optional[Callable] = None, collect_metrics_kwargs: Optional[Dict] = None, attribute_fields: Optional[Iterable[str]] = None, fields_to_ignore: Optional[Container[str]] = None, attributes_are_np_arrays: bool = False) → pyllars.ml_utils.estimators_predictions_metrics[source]¶ Evaluate hyperparameters for fold
N.B. This function is not particularly efficient with creating copies of data.
This function performs the following steps:
- Create estimator_val and estimator_test based on estimator_template and hyperparameters
- Split data into train, val, test based on validation_fold and test_fold
- Transform target_field using the target_transform function
- Train estimator_val using train
- Evaluate the trained estimator_val on val using collect_metrics
- Train estimator_test using both train and val
- Evaluate the trained estimator_test on test using collect_metrics
Parameters: - estimator_template (sklearn.base.BaseEstimator) – The template for creating the estimator.
- hyperparameters (typing.Dict) – The hyperparameters for the model. These should be compatible with estimator_template.set_params.
- validation_folds (typing.Any) – The fold(s) to use for validation. The validation fold will be selected based on isin. If validation_fold is not a container, it will be cast as one.
- test_folds (typing.Any) – The fold(s) to use for testing. The test fold will be selected based on isin. If test_fold is not a container, it will be cast as one.
- data (pandas.DataFrame) – The data.
- collect_metrics (typing.Callable) – The function for evaluating the model performance. It should have at least two arguments, y_true and y_pred, in that order. This function will eventually return whatever this function returns.
- use_predict_proba (bool) – Whether to use predict (when False, the default) or predict_proba on the trained model.
- train_folds (typing.Optional[typing.Any]) – The fold(s) to use for training. If not given, the training fold will be taken as all rows in data which are not part of the validation or testing set.
- split_field (str) – The name of the column with the fold identifiers
- target_field (str) – The name of the column with the target value
- target_transform (typing.Optional[typing.Callable]) – A function for transforming the target before training models.
Example:
numpy.log1p()
- target_inverse_transform (typing.Optional[typing.Callable]) – A function for transforming model predictions back to the original
domain. This should be a mathematical inverse of target_transform.
Example:
numpy.expm1()
is the inverse ofnumpy.log1p()
. - collect_metrics_kwargs (typing.Optional[typing.Dict]) – Additional keyword arguments for collect_metrics.
- attribute_fields (typing.Optional[typing.Iterable[str]]) – The names of the columns to use for attributes (that is, X). If None (default), then all columns except the target_field will be used as attributes.
- fields_to_ignore (typing.Optional[typing.Container[str]]) – The names of the columns to ignore.
- attributes_are_np_arrays (bool) – Whether to stack the values from the individual rows. This should be set to True when some of the columns in attribute_fields contain numpy arrays.
Returns: estimators_predictions_metrics – The fit estimators, predictions on the respective datasets, and results from collect_metrics.
Return type:
-
pyllars.ml_utils.
get_cv_folds
(y: numpy.ndarray, num_splits: int = 10, use_stratified: bool = True, shuffle: bool = True, random_state: int = 8675309) → numpy.ndarray[source]¶ Assign a split to each row based on the values of y
Parameters: - y (numpy.ndarray) – The target variable for each row in a data frame. This is used to determine the stratification.
- num_splits (int) – The number of stratified splits to use
- use_stratified (bool) – Whether to use stratified cross-validation. For example, this may be set to False if choosing folds for regression.
- shuffle (bool) – Whether to shuffle during the split
- random_state (int) – The state for the random number generator
Returns: splits – The split of each row
Return type:
-
pyllars.ml_utils.
get_fold_data
(df: pandas.core.frame.DataFrame, target_field: str, m_train: numpy.ndarray, m_test: numpy.ndarray, m_validation: Optional[numpy.ndarray] = None, attribute_fields: Optional[Iterable[str]] = None, fields_to_ignore: Optional[Iterable[str]] = None, attributes_are_np_arrays: bool = False) → pyllars.ml_utils.fold_data[source]¶ Prepare a data frame for sklearn according to the given splits
N.B. This function creates copies of the data, so it is not appropriate for very large datasets.
Parameters: - df (pandas.DataFrame) – A data frame
- target_field (str) – The name of the column containing the target variable
- m_{train,test,validation} (np.ndarray) – Boolean masks indicating the training, testing, and validation set rows. If m_validation is None (default), then no validation set will be included.
- attribute_fields (typing.Optional[typing.Iterable[str]]) – The names of the columns to use for attributes (that is, X). If None (default), then all columns except the target_field will be used as attributes.
- fields_to_ignore (typing.Optional[typing.Container[str]]) – The names of the columns to ignore.
- attributes_are_np_arrays (bool) – Whether to stack the values from the individual rows. This should be set to True when some of the columns in attribute_fields contain numpy arrays.
Returns: fold_data – A named tuple with the given splits
Return type:
-
pyllars.ml_utils.
get_train_val_test_splits
(df: pandas.core.frame.DataFrame, training_splits: Optional[Set] = None, validation_splits: Optional[Set] = None, test_splits: Optional[Set] = None, split_field: str = 'split') → pyllars.ml_utils.split_masks[source]¶ Get the appropriate training, validation, and testing split masks
The split_field column in df is used to assign each row to a particular split. Then, the splits specified in the parameters are assigned as indicated.
By default, all splits not in validation_splits and test_splits are assumed to belong to the training set. Thus, unless a particular training set is given, the returned masks will cover the entire dataset.
This function does not check whether the different splits overlap. So care should be taken, especially if specifying the training splits explicitly.
It is not necessary that the split_field values are numeric. They must be compatible with isin, however.
Parameters: - df (pandas.DataFrame) – A data frame. It must contain a column named split_field, but it is not otherwise validated.
- training_splits (typing.Optional[typing.Set]) –
The splits to use for the training set. By default, anything not in the validation_splits or test_splits will be placed in the training set.
If given, this container must be compatible with isin. Otherwise, it will be wrapped in a set.
- {validation,test}_splits (typing.Optional[typing.Set]) –
The splits to use for the validation and test sets, respectively.
If given, this container must be compatible with isin. Otherwise, it will be wrapped in a set.
- split_field (str) – The name of the column indicating the split for each row.
Returns: split_masks – Masks for the respective sets. True positions indicate the rows which belong to the respective sets. All three masks are always returned, but a mask may be always False if the given split does not contain any rows.
Return type:
-
pyllars.ml_utils.
precision_at_k
(y_true, y_score, k=10, pos_label=1)[source]¶ Precision at rank k
This code was adapted from this gist: https://gist.github.com/mblondel/7337391
Parameters: Returns: precision @k
Return type:
-
class
pyllars.ml_utils.
fold_data
[source]¶ A named tuple for holding train, validation, and test datasets suitable for use in sklearn.
This class can be more convenient than
pyllars.ml_utils.split_masks
for modest-sized datasets.-
X_{train,test,validation}
The X data (features) for the respective dataset splits
Type: numpy.ndarray
-
y_{train,test,validation}
The y data (target) for the respective dataset splits
Type: numpy.ndarray
-
{train_test,validation}_indices
The row indices from the original dataset of the respective dataset splits
Type: numpy.ndarray
-
-
class
pyllars.ml_utils.
split_masks
[source]¶ A named tuple for holding boolean masks for the train, validation, and test splits of a complete dataset.
These masks can be used to index
numpy.ndarray
orpandas.DataFrame
objects to extract the relevant dataset split for sklearn. This class can be more appropriate thanpyllars.ml_utils.fold_data
for large objects since it avoids any copies of the data.-
training,test,validation
Boolean masks for the respective dataset splits
Type: numpy.ndarray
-