skll Package

We have made the most useful parts of our API available in the top-level skll namespace even though some of them actually live in subpackages. They are documented in both places for convenience.

From data Package

class skll.FeatureSet(name, ids, labels=None, features=None, vectorizer=None)[source]

Bases: object

Encapsulation of all of the features, values, and metadata about a given set of data. This replaces ExamplesTuple from older versions of SKLL.

Parameters:
  • name (str) – The name of this feature set.
  • ids (np.array) – Example IDs for this set.
  • labels (np.array, optional) – labels for this set. Defaults to None.
  • feature (list of dict or array-like, optional) – The features for each instance represented as either a list of dictionaries or an array-like (if vectorizer is also specified). Defaults to None.
  • vectorizer (DictVectorizer or FeatureHasher, optional) – Vectorizer which will be used to generate the feature matrix. Defaults to None.

Warning

FeatureSets can only be equal if the order of the instances is identical because these are stored as lists/arrays. Since scikit-learn’s DictVectorizer automatically sorts the underlying feature matrix if it is sparse, we do not do any sorting before checking for equality. This is not a problem because we _always_ use sparse matrices with DictVectorizer when creating FeatureSets.

Notes

If ids, labels, and/or features are not None, the number of rows in each array must be equal.

filter(ids=None, labels=None, features=None, inverse=False)[source]

Removes or keeps features and/or examples from the Featureset depending on the parameters. Filtering is done in-place.

Parameters:
  • ids (list of str/float, optional) – Examples to keep in the FeatureSet. If None, no ID filtering takes place. Defaults to None.
  • labels (list of str/float, optional) – Labels that we want to retain examples for. If None, no label filtering takes place. Defaults to None.
  • features (list of str, optional) – Features to keep in the FeatureSet. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a = will be split on the first occurrence and the prefix will be checked to see if it is in features. If None, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization. Defaults to None.
  • inverse (bool, optional) – Instead of keeping features and/or examples in lists, remove them. Defaults to False.
Raises:

ValueError – If attempting to use features to filter a FeatureSet that uses a FeatureHasher vectorizer.

filtered_iter(ids=None, labels=None, features=None, inverse=False)[source]

A version of __iter__ that retains only the specified features and/or examples from the output.

Parameters:
  • ids (list of str/float, optional) – Examples to keep in the FeatureSet. If None, no ID filtering takes place. Defaults to None.
  • labels (list of str/float, optional) – Labels that we want to retain examples for. If None, no label filtering takes place. Defaults to None.
  • features (list of str, optional) – Features to keep in the FeatureSet. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a = will be split on the first occurrence and the prefix will be checked to see if it is in features. If None, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization. Defaults to None.
  • inverse (bool, optional) – Instead of keeping features and/or examples in lists, remove them. Defaults to False.
Yields:
  • id_ (str) – The ID of the example.
  • label_ (str) – The label of the example.
  • feat_dict (dict) – The feature dictionary, with feature name as the key and example value as the value.
Raises:

ValueError – If the vectorizer is not a DictVectorizer.

static from_data_frame(df, name, labels_column=None, vectorizer=None)[source]

Helper function to create a FeatureSet instance from a pandas.DataFrame. Will raise an Exception if pandas is not installed in your environment. The ids in the FeatureSet will be the index from the given frame.

Parameters:
  • df (pd.DataFrame) – The pandas.DataFrame object to use as a FeatureSet.
  • name (str) – The name of the output FeatureSet instance.
  • labels_column (str, optional) – The name of the column containing the labels (data to predict). Defaults to None.
  • vectorizer (DictVectorizer or FeatureHasher, optional) – Vectorizer which will be used to generate the feature matrix. Defaults to None.
Returns:

feature_set – A FeatureSet instance generated from from the given data frame.

Return type:

skll.FeatureSet

has_labels

Check if FeatureSet has finite labels.

Returns:has_labels – Whether or not this FeatureSet has any finite labels.
Return type:bool
static split_by_ids(fs, ids_for_split1, ids_for_split2=None)[source]

Split the FeatureSet into two new FeatureSet instances based on the given IDs for the two splits.

Parameters:
  • fs (skll.FeatureSet) – The FeatureSet instance to split.
  • ids_for_split1 (list of int) – A list of example IDs which will be split out into the first FeatureSet instance. Note that the FeatureSet instance will respect the order of the specified IDs.
  • ids_for_split2 (list of int, optional) – An optional ist of example IDs which will be split out into the second FeatureSet instance. Note that the FeatureSet instance will respect the order of the specified IDs. If this is not specified, then the second FeatureSet instance will contain the complement of the first set of IDs sorted in ascending order. Defaults to None.
Returns:

  • fs1 (skll.FeatureSet) – The first FeatureSet.
  • fs2 (skll.FeatureSet) – The second FeatureSet.

From experiments Package

skll.run_configuration(config_file, local=False, overwrite=True, queue='all.q', hosts=None, write_summary=True, quiet=False, ablation=0, resume=False, log_level=20)[source]

Takes a configuration file and runs the specified jobs on the grid.

Parameters:
  • config_file (str) – Path to the configuration file we would like to use.
  • local (bool, optional) – Should this be run locally instead of on the cluster? Defaults to False.
  • overwrite (bool, optional) – If the model files already exist, should we overwrite them instead of re-using them? Defaults to True.
  • queue (str, optional) – The DRMAA queue to use if we’re running on the cluster. Defaults to 'all.q'.
  • hosts (list of str, optional) – If running on the cluster, these are the machines we should use. Defaults to None.
  • write_summary (bool, optional) – Write a TSV file with a summary of the results. Defaults to True.
  • quiet (bool, optional) – Suppress printing of “Loading…” messages. Defaults to False.
  • ablation (int, optional) – Number of features to remove when doing an ablation experiment. If positive, we will perform repeated ablation runs for all combinations of features removing the specified number at a time. If None, we will use all combinations of all lengths. If 0, the default, no ablation is performed. If negative, a ValueError is raised. Defaults to 0.
  • resume (bool, optional) – If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes. Defaults to False.
  • log_level (str, optional) – The level for logging messages. Defaults to logging.INFO.
Returns:

result_json_paths – A list of paths to .json results files for each variation in the experiment.

Return type:

list of str

Raises:
  • ValueError – If value for "ablation" is not a positive int or None.
  • OSError – If the lenth of the FeatureSet name > 210.

From learner Package

class skll.Learner(model_type, probability=False, pipeline=False, feature_scaling='none', model_kwargs=None, pos_label_str=None, min_feature_count=1, sampler=None, sampler_kwargs=None, custom_learner_path=None, logger=None)[source]

Bases: object

A simpler learner interface around many scikit-learn classification and regression estimators.

Parameters:
  • model_type (str) – Name of estimator to create (e.g., 'LogisticRegression'). See the skll package documentation for valid options.
  • probability (bool, optional) – Should learner return probabilities of all labels (instead of just label with highest probability)? Defaults to False.
  • pipeline (bool, optional) – Should learner contain a pipeline attribute that contains a scikit-learn Pipeline object composed of all steps including the vectorizer, the feature selector, the sampler, the feature scaler, and the actual estimator. Note that this will increase the size of the learner object in memory and also when it is saved to disk. Defaults to False.
  • feature_scaling (str, optional) – How to scale the features, if at all. Options are - ‘with_std’: scale features using the standard deviation - ‘with_mean’: center features using the mean - ‘both’: do both scaling as well as centering - ‘none’: do neither scaling nor centering Defaults to ‘none’.
  • model_kwargs (dict, optional) – A dictionary of keyword arguments to pass to the initializer for the specified model. Defaults to None.
  • pos_label_str (str, optional) – A string denoting the label of the class to be treated as the positive class in a binary classification setting. If None, the class represented by the label that appears second when sorted is chosen as the positive class. For example, if the two labels in data are “A” and “B” and pos_label_str is not specified, “B” will be chosen as the positive class. Defaults to None.
  • min_feature_count (int, optional) – The minimum number of examples a feature must have a nonzero value in to be included. Defaults to 1.
  • sampler (str, optional) – The sampler to use for kernel approximation, if desired. Valid values are - ‘AdditiveChi2Sampler’ - ‘Nystroem’ - ‘RBFSampler’ - ‘SkewedChi2Sampler’ Defaults to None.
  • sampler_kwargs (dict, optional) – A dictionary of keyword arguments to pass to the initializer for the specified sampler. Defaults to None.
  • custom_learner_path (str, optional) – Path to module where a custom classifier is defined. Defaults to None.
  • logger (logging object, optional) – A logging object. If None is passed, get logger from __name__. Defaults to None.
cross_validate(examples, stratified=True, cv_folds=10, grid_search=True, grid_search_folds=3, grid_jobs=None, grid_objective=None, output_metrics=[], prediction_prefix=None, param_grid=None, shuffle=False, save_cv_folds=False, save_cv_models=False, use_custom_folds_for_grid_search=True)[source]

Cross-validates a given model on the training examples.

Parameters:
  • examples (skll.FeatureSet) – The FeatureSet instance to cross-validate learner performance on.
  • stratified (bool, optional) – Should we stratify the folds to ensure an even distribution of labels for each fold? Defaults to True.
  • cv_folds (int or dict, optional) – The number of folds to use for cross-validation, or a mapping from example IDs to folds. Defaults to 10.
  • grid_search (bool, optional) – Should we do grid search when training each fold? Note: This will make this take much longer. Defaults to False.
  • grid_search_folds (int or dict, optional) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds. Defaults to 3.
  • grid_jobs (int, optional) – The number of jobs to run in parallel when doing the grid search. If None or 0, the number of grid search folds will be used. Defaults to None.
  • grid_objective (str, optional) – The name of the objective function to use when doing the grid search. Must be specified if grid_search is True. Defaults to None.
  • output_metrics (list of str, optional) – List of additional metric names to compute in addition to the metric used for grid search. Empty by default. Defaults to an empty list.
  • prediction_prefix (str, optional) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by "_predictions.tsv" Defaults to None.
  • param_grid (dict, optional) – The parameter grid to search. Defaults to None.
  • shuffle (bool, optional) – Shuffle examples before splitting into folds for CV. Defaults to False.
  • save_cv_folds (bool, optional) – Whether to save the cv fold ids or not? Defaults to False.
  • save_cv_models (bool, optional) – Whether to save the cv models or not? Defaults to False.
  • use_custom_folds_for_grid_search (bool, optional) – If cv_folds is a custom dictionary, but grid_search_folds is not, perhaps due to user oversight, should the same custom dictionary automatically be used for the inner grid-search cross-validation? Defaults to True.
Returns:

  • results (list of 6-tuples) – The confusion matrix, overall accuracy, per-label PRFs, model parameters, objective function score, and evaluation metrics (if any) for each fold.
  • grid_search_scores (list of floats) – The grid search scores for each fold.
  • grid_search_cv_results_dicts (list of dicts) – A list of dictionaries of grid search CV results, one per fold, with keys such as “params”, “mean_test_score”, etc, that are mapped to lists of values associated with each hyperparameter set combination.
  • skll_fold_ids (dict) – A dictionary containing the test-fold number for each id if save_cv_folds is True, otherwise None.
  • models (list of skll.learner.Learner) – A list of skll.learner.Learners, one for each fold if save_cv_models is True, otherwise None.

Raises:

ValueError – If classification labels are not properly encoded as strings.

evaluate(examples, prediction_prefix=None, append=False, grid_objective=None, output_metrics=[])[source]

Evaluates a given model on a given dev or test FeatureSet.

Parameters:
  • examples (skll.FeatureSet) – The FeatureSet instance to evaluate the performance of the model on.
  • prediction_prefix (str, optional) – If not None, predictions will also be written out to a file with the name <prediction_prefix>_predictions.tsv. Note that the prefix can also contain a path. Defaults to None.
  • append (bool, optional) – Should we append the current predictions to the file if it exists? Defaults to False.
  • grid_objective (function, optional) – The objective function that was used when doing the grid search. Defaults to None.
  • output_metrics (list of str, optional) – List of additional metric names to compute in addition to grid objective. Empty by default. Defaults to an empty list.
Returns:

res – The confusion matrix, the overall accuracy, the per-label PRFs, the model parameters, the grid search objective function score, and the additional evaluation metrics, if any. For regressors, the first two elements are None.

Return type:

6-tuple

classmethod from_file(learner_path, logger=None)[source]

Load a saved Learner instance from a file path.

Parameters:
  • learner_path (str) – The path to a saved Learner instance file.
  • logger (logging object, optional) – A logging object. If None is passed, get logger from __name__. Defaults to None.
Returns:

learner – The Learner instance loaded from the file.

Return type:

skll.Learner

Raises:
  • ValueError – If the pickled object is not a Learner instance.
  • ValueError – If the pickled version of the Learner instance is out of date.
learning_curve(examples, metric, cv_folds=10, train_sizes=array([0.1, 0.325, 0.55, 0.775, 1. ]))[source]

Generates learning curves for a given model on the training examples via cross-validation. Adapted from the scikit-learn code for learning curve generation (cf.``sklearn.model_selection.learning_curve``).

Parameters:
  • examples (skll.FeatureSet) – The FeatureSet instance to generate the learning curve on.
  • cv_folds (int or dict, optional) – The number of folds to use for cross-validation, or a mapping from example IDs to folds. Defaults to 10.
  • metric (str) – The name of the metric function to use when computing the train and test scores for the learning curve.
  • train_sizes (list of float or int, optional) – Relative or absolute numbers of training examples that will be used to generate the learning curve. If the type is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. Defaults to np.linspace(0.1, 1.0, 5).
Returns:

  • train_scores (list of float) – The scores for the training set.
  • test_scores (list of float) – The scores on the test set.
  • num_examples (list of int) – The numbers of training examples used to generate the curve

load(learner_path)[source]

Replace the current learner instance with a saved learner.

Parameters:learner_path (str) – The path to a saved learner object file to load.
model

The underlying scikit-learn model

model_kwargs

A dictionary of the underlying scikit-learn model’s keyword arguments

model_params

Model parameters (i.e., weights) for a LinearModel (e.g., Ridge) regression and liblinear models. If the model was trained using feature hashing, then names of the form hashed_feature_XX are used instead.

Returns:
  • res (dict) – A dictionary of labeled weights.
  • intercept (dict) – A dictionary of intercept(s).
Raises:ValueError – If the instance does not support model parameters.
model_type

The model type (i.e., the class)

predict(examples, prediction_prefix=None, append=False, class_labels=True)[source]

Uses a given model to return, and optionally, write out predictions on a given FeatureSet to a file.

For regressors, the returned and written-out predictions are identical. However, for classifiers: - if class_labels is True, class labels are returned

as well as written out.
  • if class_labels is False and the classifier is probabilistic (i.e., self..probability is True), class probabilities are returned as well as written out.
  • if class_labels is False and the classifier is non-probabilistic (i.e., self..probability is False), class indices are returned and class labels are written out.

TL;DR: for regressors, just ignore class_labels. For classfiers, set it to True to get class labels and False to get class probabilities.

Parameters:
  • examples (skll.FeatureSet) – The FeatureSet instance to predict labels for.
  • prediction_prefix (str, optional) – If not None, predictions will also be written out to a file with the name <prediction_prefix>_predictions.tsv. For classifiers, the predictions written out are class labels unless the learner is probabilistic AND class_labels is set to False. Note that this prefix can also contain a path. Defaults to None.
  • append (bool, optional) – Should we append the current predictions to the file if it exists? Defaults to False.
  • class_labels (bool, optional) – If False, return either the class probabilities (probabilistic classifiers) or the class indices (non-probabilistic ones). If True, return the class labels no matter what. Ignored for regressors. Defaults to True.
Returns:

yhat – The predictions returned by the Learner instance.

Return type:

array-like

Raises:
  • AssertionError – If invalid predictions are being returned or written out.
  • MemoryError – If process runs out of memory when converting to dense.
  • RuntimeError – If there is a mismatch between the learner vectorizer and the test set vectorizer.
probability

Should learner return probabilities of all labels (instead of just label with highest probability)?

save(learner_path)[source]

Save the Learner instance to a file.

Parameters:learner_path (str) – The path to save the Learner instance to.
train(examples, param_grid=None, grid_search_folds=3, grid_search=True, grid_objective=None, grid_jobs=None, shuffle=False)[source]

Train the model underlying the learner and return the grid search score and a dictionary of grid search results.

Parameters:
  • examples (skll.FeatureSet) – The FeatureSet instance to use for training.
  • param_grid (dict, optional) – The parameter grid to search through for grid search. If None, a default parameter grid will be used. Defaults to None.
  • grid_search_folds (int or dict, optional) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds. Defaults to 3.
  • grid_search (bool, optional) – Should we do grid search? Defaults to True.
  • grid_objective (str, optional) – The name of the objective function to use when doing the grid search. Must be specified if grid_search is True. Defaults to None.
  • grid_jobs (int, optional) – The number of jobs to run in parallel when doing the grid search. If None or 0, the number of grid search folds will be used. Defaults to None.
  • shuffle (bool, optional) – Shuffle examples (e.g., for grid search CV.) Defaults to False.
Returns:

tuple – 1) The best grid search objective function score, or 0 if we’re not doing grid search, and 2) a dictionary of grid search CV results with keys such as “params”, “mean_test_score”, etc, that are mapped to lists of values associated with each hyperparameter set combination, or None if not doing grid search.

Return type:

(float, dict)

Raises:
  • ValueError – If grid_objective is not a valid grid objective or if one is not specified when necessary.
  • MemoryError – If process runs out of memory converting training data to dense.
  • ValueError – If FeatureHasher is used with MultinomialNB.