learner
Module¶
Provides easytouse wrapper around scikitlearn.
author:  Michael Heilman (mheilman@ets.org) 

author:  Nitin Madnani (nmadnani@ets.org) 
author:  Dan Blanchard (dblanchard@ets.org) 
author:  Aoife Cahill (acahill@ets.org) 
organization:  ETS 

class
skll.learner.
FilteredLeaveOneGroupOut
(keep, example_ids)[source]¶ Bases:
sklearn.model_selection._split.LeaveOneGroupOut
Version of LeaveOneGroupOut crossvalidation iterator that only outputs indices of instances with IDs in a prespecified set.

class
skll.learner.
Learner
(model_type, probability=False, feature_scaling=u'none', model_kwargs=None, pos_label_str=None, min_feature_count=1, sampler=None, sampler_kwargs=None, custom_learner_path=None)[source]¶ Bases:
object
A simpler learner interface around many scikitlearn classification and regression functions.
Parameters:  model_type (str) – Type of estimator to create (e.g., LogisticRegression). See the skll package documentation for valid options.
 probability (bool) – Should learner return probabilities of all labels (instead of just label with highest probability)?
 feature_scaling (str) – how to scale the features, if at all. Options are: ‘with_std’: scale features using the standard deviation, ‘with_mean’: center features using the mean, ‘both’: do both scaling as well as centering, ‘none’: do neither scaling nor centering
 model_kwargs (dict) – A dictionary of keyword arguments to pass to the initializer for the specified model.
 pos_label_str (str) – The string for the positive label in the binary classification setting. Otherwise, an arbitrary label is picked.
 min_feature_count (int) – The minimum number of examples a feature must have a nonzero value in to be included.
 sampler (str) – The sampler to use for kernel approximation, if desired.
Valid values are:
'AdditiveChi2Sampler'
,'Nystroem'
,'RBFSampler'
, and'SkewedChi2Sampler'
.  sampler_kwargs (dict) – A dictionary of keyword arguments to pass to the initializer for the specified sampler.
 custom_learner_path (str) – Path to module where a custom classifier is defined.

cross_validate
(examples, stratified=True, cv_folds=10, grid_search=False, grid_search_folds=3, grid_jobs=None, grid_objective=u'f1_score_micro', prediction_prefix=None, param_grid=None, shuffle=False, save_cv_folds=False)[source]¶ Crossvalidates a given model on the training examples.
Parameters:  examples (FeatureSet) – The data to crossvalidate learner performance on.
 stratified (bool) – Should we stratify the folds to ensure an even distribution of labels for each fold?
 cv_folds (int or dict) – The number of folds to use for crossvalidation, or a mapping from example IDs to folds.
 grid_search (bool) – Should we do grid search when training each fold? Note: This will make this take much longer.
 grid_search_folds (int) – The number of folds to use when doing the grid search (ignored if cv_folds is set to a dictionary mapping examples to folds).
 grid_jobs (int) – The number of jobs to run in parallel when doing the grid search. If unspecified or 0, the number of grid search folds will be used.
 grid_objective (function) – The objective function to use when doing the grid search.
 param_grid (list of dicts mapping from strs to lists of parameter values) – The parameter grid to search through for grid search. If unspecified, a default parameter grid will be used.
 prediction_prefix (str) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by ”.predictions”
 shuffle (bool) – Shuffle examples before splitting into folds for CV.
 save_cv_folds (bool) – Whether to save the cv fold ids or not
Returns: The confusion matrix, overall accuracy, perlabel PRFs, and model parameters for each fold in one list, and another list with the grid search scores for each fold. Also return a dictionary containing the testfold number for each id if save_cv_folds is True, otherwise None.
Return type: (list of 4tuples, list of float, dict)

evaluate
(examples, prediction_prefix=None, append=False, grid_objective=None)[source]¶ Evaluates a given model on a given dev or test example set.
Parameters:  examples (FeatureSet) – The examples to evaluate the performance of the model on.
 prediction_prefix (str) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by ”.predictions”
 append (bool) – Should we append the current predictions to the file if it exists?
 grid_objective (function) – The objective function that was used when doing the grid search.
Returns: The confusion matrix, the overall accuracy, the perlabel PRFs, the model parameters, and the grid search objective function score.
Return type: 5tuple

classmethod
from_file
(learner_path)[source]¶ Returns: New instance of Learner from the pickle at the specified path.

learning_curve
(examples, cv_folds=10, train_sizes=array([ 0.1, 0.325, 0.55, 0.775, 1. ]), objective=u'f1_score_micro')[source]¶ Generates learning curves for a given model on the training examples via crossvalidation. Adapted from the scikitlearn code for learning curve generation (cf.
`sklearn.model_selection.learning_curve`
).Parameters:  examples (skll.data.FeatureSet) – The data to generate the learning curve on.
 cv_folds (int) – The number of folds to use for crossvalidation with each training size
 train_sizes (list of float or int) – Relative or absolute numbers of training examples that will be used to generate the learning curve. If the type is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. (default: np.linspace(0.1, 1.0, 5))
 objective (string) – The name of the objective function to use when computing the train and test scores for the learning curve. (default: ‘f1_score_micro’)
Returns: The scores on the training sets, the scores on the test set, and the numbers of training examples used to generate the curve.
Return type: (list of float, list of float, list of int)

load
(learner_path)[source]¶ Replace the current learner instance with a saved learner.
Parameters: learner_path (str) – The path to the file to load.

model
¶ The underlying scikitlearn model

model_kwargs
¶ A dictionary of the underlying scikitlearn model’s keyword arguments

model_params
¶ Model parameters (i.e., weights) for
LinearModel
(e.g.,Ridge
) regression and liblinear models.Returns: Labeled weights and (labeled if more than one) intercept value(s) Return type: tuple of ( weights
,intercepts
), whereweights
is a dict andintercepts
is a dictionary

model_type
¶ The model type (i.e., the class)

predict
(examples, prediction_prefix=None, append=False, class_labels=False)[source]¶ Uses a given model to generate predictions on a given data set
Parameters:  examples (FeatureSet) – The examples to predict the labels for.
 prediction_prefix (str) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by ”.predictions”
 append (bool) – Should we append the current predictions to the file if it exists?
 class_labels (bool) – For classifier, should we convert class indices to their (str) labels?
Returns: The predictions returned by the learner.
Return type: array

probability
¶ Should learner return probabilities of all labels (instead of just label with highest probability)?

save
(learner_path)[source]¶ Save the learner to a file.
Parameters: learner_path (str) – The path to where you want to save the learner.

train
(examples, param_grid=None, grid_search_folds=3, grid_search=True, grid_objective=u'f1_score_micro', grid_jobs=None, shuffle=False, create_label_dict=True)[source]¶ Train a classification model and return the model, score, feature vectorizer, scaler, label dictionary, and inverse label dictionary.
Parameters:  examples (FeatureSet) – The examples to train the model on.
 param_grid (list of dicts mapping from strs to lists of parameter values) – The parameter grid to search through for grid search. If unspecified, a default parameter grid will be used.
 grid_search_folds (int or dict) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds.
 grid_search (bool) – Should we do grid search?
 grid_objective (function) – The objective function to use when doing the grid search.
 grid_jobs (int) – The number of jobs to run in parallel when doing the grid search. If unspecified or 0, the number of grid search folds will be used.
 shuffle (bool) – Shuffle examples (e.g., for grid search CV.)
 create_label_dict (bool) – Should we create the label dictionary? This
dictionary is used to map between string
labels and their corresponding numerical
values. This should only be done once per
experiment, so when
cross_validate
callstrain
,create_label_dict
gets set toFalse
.
Returns: The best grid search objective function score, or 0 if we’re not doing grid search.
Return type: float

class
skll.learner.
SelectByMinCount
(min_count=1)[source]¶ Bases:
sklearn.feature_selection.univariate_selection.SelectKBest
Select features ocurring in more (and/or fewer than) than a specified number of examples in the training data (or a CV training fold).

skll.learner.
rescaled
(cls)[source]¶ Decorator to create regressors that store a min and a max for the training data and make sure that predictions fall within that range. It also stores the means and SDs of the gold standard and the predictions on the training set to rescale the predictions (e.g., as in erater).
Parameters: cls (BaseEstimator) – A regressor to add rescaling to. Returns: Modified version of class with rescaled functions added.