# Running Experiments¶

The simplest way to use SKLL is to create configuration files that describe experiments you would like to run on pre-generated features. This document describes the supported feature file formats, how to create configuration files (and layout your directories), and how to use run_experiment to get things going.

## Quick Example¶

If you don’t want to read the whole document, and just want an example of how things work, do the following from the command prompt:

$cd examples$ python make_example_iris_data.py          # download a simple dataset
$cd iris$ run_experiment --local evaluate.cfg        # run an experiment


## Feature file formats¶

The following feature file formats are supported:

### arff¶

The same file format used by Weka with the following added restrictions:

• Only simple numeric, string, and nomimal values are supported.
• Nominal values are converted to strings.
• If the data has instance IDs, there should be an attribute with the name specified by id_col in the Input section of the configuration file you create for your experiment. This defaults to id. If there is no such attribute, IDs will be generated automatically.
• If the data is labelled, there must be an attribute with the name specified by label_col in the Input section of the configuartion file you create for your experiment. This defaults to y. This must also be the final attribute listed (like in Weka).

### csv/tsv¶

A simple comma or tab-delimited format with the following restrictions:

• If the data is labelled, there must be a column with the name specified by label_col in the Input section of the configuartion file you create for your experiment. This defaults to y.
• If the data has instance IDs, there should be a column with the name specified by id_col in the Input section of the configuration file you create for your experiment. This defaults to id. If there is no such column, IDs will be generated automatically.
• All other columns contain feature values, and every feature value must be specified (making this a poor choice for sparse data).

### libsvm¶

While we can process the standard input file format supported by LibSVM, LibLinear, and SVMLight, we also support specifying extra metadata usually missing from the format in comments at the of each line. The comments are not mandatory, but without them, your labels and features will not have names. The comment is structured as follows:

ID | 1=ClassX | 1=FeatureA 2=FeatureB


The entire format would like this:

2 1:2.0 3:8.1 # Example1 | 2=ClassY | 1=FeatureA 3=FeatureC
1 5:7.0 6:19.1 # Example2 | 1=ClassX | 5=FeatureE 6=FeatureF


Note

IDs, labels, and feature names cannot contain the following characters: | # =

### megam¶

An expanded form of the input format for the MegaM classification package with the -fvals switch.

The basic format is:

# Instance1
CLASS1    F0 2.5 F1 3 FEATURE_2 -152000
# Instance2
CLASS2    F1 7.524


where the optional comments before each instance specify the ID for the following line, class names are separated from feature-value pairs with a tab, and feature-value pairs are separated by spaces. Any omitted features for a given instance are assumed to be zero, so this format is handy when dealing with sparse data. We also include several utility scripts for converting to/from this MegaM format and for adding/removing features from the files.

## Creating configuration files¶

The experiment configuration files that run_experiment accepts are standard Python configuration files that are similar in format to Windows INI files. [1] There are four expected sections in a configuration file: General, Input, Tuning, and Output. A detailed description of each possible settings for each section is provided below, but to summarize:

• If you want to do cross-validation, specify a path to training feature files, and set task to cross_validate. Please note that the cross-validation currently uses StratifiedKFold. You also can optionally use predetermined folds with the cv_folds_file setting.
• If you want to train a model and evaluate it on some data, specify a training location, a test location, and a directory to store results, and set task to evaluate.
• If you want to just train a model and generate predictions, specify a training location, a test location, and set task to predict.
• If you want to just train a model, specify a training location, and set task to train.
• If you want to generate a learning curve for your data, specify a training location and set task to learning_curve. The learning curve is generated using essentially the same underlying process as in scikit-learn except that the SKLL feature pre-processing pipline is used while training the various models and computing the scores.

Note

Ideally, one would first do cross-validation experiments with grid search and/or ablation and get a well-performing set of features and hyper-parameters for a set of learners. Then, one would explicitly specify those features (via featuresets) and hyper-parameters (via fixed_parameters) in the config file for the learning curve and explore the impact of the size of the training data.

Example configuration files are available here.

### General¶

Both fields in the General section are required.

#### experiment_name¶

A string used to identify this particular experiment configuration. When generating result summary files, this name helps prevent overwriting previous summaries.

What types of experiment we’re trying to run. Valid options are: cross_validate, evaluate, predict, train, learning_curve.

### Input¶

The Input section has only one required field, learners, but also must contain either train_file or train_directory.

#### learners¶

List of scikit-learn models to try using. A separate job will be run for each combination of classifier and feature-set. Acceptable values are described below. Custom learners can also be specified. See custom_learner_path.

Classifiers:

Regressors:

For all regressors you can also prepend Rescaled to the beginning of the full name (e.g., RescaledSVR) to get a version of the regressor where predictions are rescaled and constrained to better match the training set.

#### train_file (Optional)¶

Path to a file containing the features to train on. Cannot be used in combination with featuresets, train_directory, or test_directory.

Note

If train_file is not specified, train_directory must be.

#### train_directory (Optional)¶

Path to directory containing training data files. There must be a file for each featureset. Cannot be used in combination with train_file or test_file.

Note

If train_directory is not specified, train_file must be.

#### test_file (Optional)¶

Path to a file containing the features to test on. Cannot be used in combination with featuresets, train_directory, or test_directory

#### test_directory (Optional)¶

Path to directory containing test data files. There must be a file for each featureset. Cannot be used in combination with train_file or test_file.

#### featuresets (Optional)¶

List of lists of prefixes for the files containing the features you would like to train/test on. Each list will end up being a job. IDs are required to be the same in all of the feature files, and a ValueError will be raised if this is not the case. Cannot be used in combination with train_file or test_file.

Note

If specifying train_directory or test_directory, featuresets is required.

#### suffix (Optional)¶

The file format the training/test files are in. Valid option are .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv.

If you omit this field, it is assumed that the “prefixes” listed in featuresets are actually complete filenames. This can be useful if you have feature files that are all in different formats that you would like to combine.

#### id_col (Optional)¶

If you’re using ARFF, CSV, or TSV files, the IDs for each instance are assumed to be in a column with this name. If no column with this name is found, the IDs are generated automatically. Defaults to id.

#### label_col (Optional)¶

If you’re using ARFF, CSV, or TSV files, the class labels for each instance are assumed to be in a column with this name. If no column with this name is found, the data is assumed to be unlabelled. Defaults to y. For ARFF files only, this must also be the final column to count as the label (for compatibility with Weka).

#### ids_to_floats (Optional)¶

If you have a dataset with lots of examples, and your input files have IDs that look like numbers (can be converted by float()), then setting this to True will save you some memory by storing IDs as floats. Note that this will cause IDs to be printed as floats in prediction files (e.g., 4.0 instead of 4 or 0004 or 4.000).

#### shuffle (Optional)¶

If True, shuffle the examples in the training data before using them for learning. This happens automatically when doing a grid search but it might be useful in other scenarios as well, e.g., online learning. Defaults to False.

#### class_map (Optional)¶

If you would like to collapse several labels into one, or otherwise modify your labels (without modifying your original feature files), you can specify a dictionary mapping from new class labels to lists of original class labels. For example, if you wanted to collapse the labels beagle and dachsund into a dog class, you would specify the following for class_map:

{'dog': ['beagle', 'dachsund']}


Any labels not included in the dictionary will be left untouched.

#### num_cv_folds (Optional)¶

The number of folds to use for cross validation. Defaults to 10.

#### random_folds (Optional)¶

Whether to use random folds for cross-validation. Defaults to False.

#### cv_folds_file (Optional)¶

Path to a csv file specifying folds for cross-validation. The first row must be a header. This header row is ignored, so it doesn’t matter what the header row contains, but it must be there. If there is no header row, whatever row is in its place will be ignored. The first column should consist of training set IDs and the second should be a string for the fold ID (e.g., 1 through 5, A through D, etc.). If specified, the CV and grid search will leave one fold ID out at a time. [2]

#### learning_curve_cv_folds_list (Optional)¶

List of integers specifying the number of folds to use for cross-validation at each point of the learning curve (training size), one per learner. For example, if you specify the following learners: ["SVC", "LogisticRegression"], specifying [10, 100] as the value of learning_curve_cv_folds_list will tell SKLL to use 10 cross-validation folds at each point of the SVC curve and 100 cross-validation folds at each point of the logistic regression curve. Although more folds will generally yield more reliable results, smaller number of folds may be better for learners that are slow to train. Defaults to 10 for each learner.

#### learning_curve_train_sizes (Optional)¶

List of floats or integers representing relative or absolute numbers of training examples that will be used to generate the learning curve respectively. If the type is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. Defaults to [0.1, 0.325, 0.55, 0.775, 1.0].

#### custom_learner_path (Optional)¶

Path to a .py file that defines a custom learner. This file will be imported dynamically. This is only required if a custom learner is specified in the list of learners.

All Custom learners must implement the fit and predict methods. Custom classifiers must either (a) inherit from an existing scikit-learn classifier, or (b) inherit from both sklearn.base.BaseEstimator. and from sklearn.base.ClassifierMixin.

Similarly, Custom regressors must either (a) inherit from an existing scikit-learn regressor, or (b) inherit from both sklearn.base.BaseEstimator. and from sklearn.base.RegressorMixin.

Learners that require dense matrices should implement a method requires_dense that returns True.

#### sampler (Optional)¶

It performs a non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms. Valid options are: Nystroem, RBFSampler, SkewedChi2Sampler, and AdditiveChi2Sampler. For additional information see the scikit-learn documentation.

#### sampler_parameters (Optional)¶

dict containing parameters you want to have fixed for the sampler. Any empty ones will be ignored (and the defaults will be used).

The default fixed parameters (beyond those that scikit-learn sets) are:

Nystroem
{'random_state': 123456789}

RBFSampler
{'random_state': 123456789}

SkewedChi2Sampler
{'random_state': 123456789}


#### feature_hasher (Optional)¶

If “true”, this enables a high-speed, low-memory vectorizer that uses feature hashing for converting feature dictionaries into NumPy arrays instead of using a DictVectorizer. This flag will drastically reduce memory consumption for data sets with a large number of features. If enabled, the user should also specify the number of features in the hasher_features field. For additional information see the scikit-learn documentation.

#### hasher_features (Optional)¶

The number of features used by the FeatureHasher if the feature_hasher flag is enabled.

Note

To avoid collisions, you should always use the power of two larger than the number of features in the data set for this setting. For example, if you had 17 features, you would want to set the flag to 32.

#### featureset_names (Optional)¶

Optional list of names for the feature sets. If omitted, then the prefixes will be munged together to make names.

#### fixed_parameters (Optional)¶

List of dicts containing parameters you want to have fixed for each classifier in learners list. Any empty ones will be ignored (and the defaults will be used).

The default fixed parameters (beyond those that scikit-learn sets) are:

LogisticRegression
{'random_state': 123456789}

LinearSVC
{'random_state': 123456789}

SVC
{'cache_size': 1000}

DecisionTreeClassifier and DecisionTreeRegressor
{'random_state': 123456789}

RandomForestClassifier and RandomForestRegressor
{'n_estimators': 500, 'random_state': 123456789}

{'n_estimators': 500, 'random_state': 123456789}

SVR
{'cache_size': 1000, 'kernel': b'linear'}


Note

This option allows us to deal with imbalanced data sets by using the parameter class_weight for the classifiers: SVC, LogisticRegression, LinearSVC and SGDClassifier.

Two possible options are available. The first one is auto, which automatically adjust weights inversely proportional to class frequencies, as shown in the following code:

{'class_weight': 'balanced'}


The second option allows you to assign a specific weight per each class. The default weight per class is 1. For example:

{'class_weight': {1: 10}}


Additional examples and information can be seen here.

#### feature_scaling (Optional)¶

Whether to scale features by their mean and/or their standard deviation. If you scale by mean, your data will automatically be converted to dense, so use caution when you have a very large dataset. Valid options are:

none
Perform no feature scaling at all.
with_std
Scale feature values by their standard deviation.
with_mean
Center features by subtracting their mean.
both
Perform both centering and scaling.

Defaults to none.

### Tuning¶

#### grid_search (Optional)¶

Whether or not to perform grid search to find optimal parameters for classifier. Defaults to False. Note that for the learning_curve task, grid search is not allowed and setting it to True will generate a warning and be ignored.

#### grid_search_folds (Optional)¶

The number of folds to use for grid search. Defaults to 3.

#### grid_search_jobs (Optional)¶

Number of folds to run in parallel when using grid search. Defaults to number of grid search folds.

#### min_feature_count (Optional)¶

The minimum number of examples for which the value of a feature must be nonzero to be included in the model. Defaults to 1.

#### objectives (Optional)¶

The objective functions to use for tuning. This is a list of one or more objective functions. Valid options are:

Classification:

• accuracy: Overall accuracy
• precision: Precision
• recall: Recall
• f1: The default scikit-learn F1 score (F1 of the positive class for binary classification, or the weighted average F1 for multiclass classification)
• f1_score_micro: Micro-averaged F1 score
• f1_score_macro: Macro-averaged F1 score
• f1_score_weighted: Weighted average F1 score
• f1_score_least_frequent: F:1 score of the least frequent class. The least frequent class may vary from fold to fold for certain data distributions.
• average_precision: Area under PR curve (for binary classification)
• roc_auc: Area under ROC curve (for binary classification)

Regression or classification with integer labels:

• unweighted_kappa: Unweighted Cohen’s kappa (any floating point values are rounded to ints)
• linear_weighted_kappa: Linear weighted kappa (any floating point values are rounded to ints)
• quadratic_weighted_kappa: Quadratic weighted kappa (any floating point values are rounded to ints)
• uwk_off_by_one: Same as unweighted_kappa, but all ranking differences are discounted by one. In other words, a ranking of 1 and a ranking of 2 would be considered equal.
• lwk_off_by_one: Same as linear_weighted_kappa, but all ranking differences are discounted by one.
• qwk_off_by_one: Same as quadratic_weighted_kappa, but all ranking differences are discounted by one.

Regression or classification with binary labels:

Regression:

• r2: R2
• neg_mean_squared_error: The negative of the mean squared error regression loss. Since scikit-learn recommends using negated loss functions as scorer functions, SKLL does the same for the sake of consistency.

Defaults to ['f1_score_micro'].

Note: Using objective=x instead of objectives=['x'] is also acceptable, for backward-compatibility.

#### param_grids (Optional)¶

List of parameter grids to search for each learner. Each parameter grid should be a list of dictionaries mapping from strings to lists of parameter values. When you specify an empty list for a learner, the default parameter grid for that learner will be searched.

The default parameter grids for each learner are:

[{'learning_rate': [0.01, 0.1, 1.0, 10.0, 100.0]}]

DecisionTreeClassifier and DecisionTreeRegressor
[{'max_features': ["auto", None]}]

ElasticNet, Lasso, and Ridge
[{'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}]

[{'max_depth': [1, 3, 5]}]

KNeighborsClassifier and KNeighborsRegressor
[{'n_neighbors': [1, 5, 10, 100],
'weights': ['uniform', 'distance']}]

LinearSVC
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}]

LogisticRegression
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}]

MultinomialNB
[{'alpha': [0.1, 0.25, 0.5, 0.75, 1.0]}]

RandomForestClassifier and RandomForestRegressor
[{'max_depth': [1, 5, 10, None]}]

SGDClassifier and SGDRegressor
[{'alpha': [0.000001, 0.00001, 0.0001, 0.001, 0.01],
'penalty': ['l1', 'l2', 'elasticnet']}]

SVC
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0],
'gamma': ['auto', 0.01, 0.1, 1.0, 10.0, 100.0]}]

SVR
[{'C': [0.01, 0.1, 1.0, 10.0, 100.0]}]


#### pos_label_str (Optional)¶

The string label for the positive class in the binary classification setting. If unspecified, an arbitrary class is picked.

### Output¶

#### probability (Optional)¶

Whether or not to output probabilities for each class instead of the most probable class for each instance. Only really makes a difference when storing predictions. Defaults to False.

#### results (Optional)¶

Directory to store result files in. If omitted, the current working directory is used.

#### log (Optional)¶

Directory to store result files in. If omitted, the current working directory is used.

#### models (Optional)¶

Directory to store trained models in. Can be omitted to not store models.

#### predictions (Optional)¶

Directory to store prediction files in. Can be omitted to not store predictions.

Note

You can use the same directory for results, log, models, and predictions.

## Using run_experiment¶

Once you have created the configuration file for your experiment, you can usually just get your experiment started by running run_experiment CONFIGFILE. That said, there are a few options that are specified via command-line arguments instead of in the configuration file:

-a <num_features>, --ablation <num_features>

Runs an ablation study where repeated experiments are conducted with the specified number of feature files in each featureset in the configuration file held out. For example, if you have three feature files (A, B, and C) in your featureset and you specifiy --ablation 1, there will be three experiments conducted with the following featuresets: [[A, B], [B, C], [A, C]]. Additionally, since every ablation experiment includes a run with all the features as a baseline, the following featureset will also be run: [[A, B, C]].

If you would like to try all possible combinations of feature files, you can use the run_experiment --ablation_all option instead.

-A, --ablation_all

Runs an ablation study where repeated experiments are conducted with all combinations of feature files in each featureset.

Warning

This can create a huge number of jobs, so please use with caution.

-k, --keep-models

If trained models already exist for any of the learner/featureset combinations in your configuration file, just load those models and do not retrain/overwrite them.

-r, --resume

If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes.

-v, --verbose

Print more status information. For every additional time this flag is specified, output gets more verbose.

--version

Show program’s version number and exit.

### GridMap options¶

If you have GridMap installed, run_experiment will automatically schedule jobs on your DRMAA- compatible cluster. You can use the following options to customize this behavior.

-l, --local

Run jobs locally instead of using the cluster. [3]

-q <queue>, --queue <queue>

Use this queue for GridMap. (default: all.q)

-m <machines>, --machines <machines>

Comma-separated list of machines to add to GridMap’s whitelist. If not specified, all available machines are used.

Note

Full names must be specified, (e.g., nlp.research.ets.org).

### Output files¶

For most of the tasks, the result, log, model, and prediction files generated by run_experiment will all share the automatically generated prefix EXPERIMENT_FEATURESET_LEARNER_OBJECTIVE, where the following definitions hold:

EXPERIMENT
The name specified as experiment_name in the configuration file.
FEATURESET
The feature set we’re training on joined with “+”.
LEARNER
The learner the current results/model/etc. was generated using.
OBJECTIVE
The objective function the current results/model/etc. was generated using.

However, if objectives contains only one objective function, the result, log, model, and prediction files will share the prefix EXPERIMENT_FEATURESET_LEARNER. For backward-compatibility, the same applies when a single objective is specified using objective=x.

For every experiment you run, there will also be a result summary file generated that is a tab-delimited file summarizing the results for each learner-featureset combination you have in your configuration file. It is named EXPERIMENT_summary.tsv. For learning_curve experiments, this summary file will contain training set sizes and the averaged scores for all combinations of featuresets, learners, and objectives.

If pandas and seaborn are available when running a learning_curve experiment, actual learning curves are also generated as PNG files - one for each feature set specified in the configuration file. Each PNG file is named EXPERIMENT_FEATURESET.png and contains a faceted learning curve plot for the featureset with objective functions on rows and learners on columns. Here’s an example of such a plot.

Footnotes

 [1] We are considering adding support for YAML configuration files in the future, but we have not added this functionality yet.
 [2] K-1 folds will be used for grid search within CV, so there should be at least 3 fold IDs.
 [3] This will happen automatically if GridMap cannot be imported.