Tutorial¶

Before doing anything below, you’ll want to install SKLL.

Workflow¶

In general, there are four steps to using SKLL:

1. Get some data in a SKLL-compatible format.
2. Create a small configuration file describing the machine learning experiment you would like to run.
3. Run that configuration file with run_experiment.
4. Examine results

Titanic Example¶

Let’s see how we can apply the basic workflow above to a simple example using the Titantic: Machine Learning from Disaster data from Kaggle.

Get your data into the correct format¶

The first step to getting the Titanic data is logging into Kaggle and downloading train.csv and test.csv. Once you have those files, you’ll also want to grab the examples folder on our GitHub page and put train.csv and test.csv in examples.

The provided script, make_titanic_example_data.py, will split the training and test data files from Kaggle up into groups of related features and store them in dev, test, train, and train+dev subdirectories. The development set that gets created by the script is 20% of the data that was in the original training set, and train contains the other 80%.

Create a configuration file for the experiment¶

For this tutorial, we will refer to an “experiment” as having a single data set split into training and testing portions. As part of each experiment, we can train and test several models, either simultaneously or sequentially, depending whether we’re using GridMap or not. This will be described in more detail later on, when we are ready to run our experiment.

You can consult the full list of learners currently available in SKLL to get an idea for the things you can do. As part of this tutorial, we will use the following classifiers:

• Decision Tree
• Multinomial Naïve Bayes
• Random Forest
• Support Vector Machine
[General]
experiment_name = Titanic_Evaluate_Tuned

[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_directory = train
test_directory = dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
label_col = Survived
id_col = PassengerId

[Tuning]
grid_search = true
objectives = ['accuracy']

[Output]
# again, these can be absolute paths
metrics = ['roc_auc']
log = output
results = output
predictions = output
models = output

Let’s take a look at the options specified in titanic/evaluate_tuned.cfg. Here, we are only going to train a model and evaluate its performance on the development set, because in the General section, task is set to evaluate. We will explore the other options for task later.

In the Input section, we have specified relative paths to the training and testing directories via the train_directory and test_directory settings respectively. featuresets indicates the name of both the training and testing files. learners must always be specified in between [ ] brackets, even if you only want to use one learner. This is similar to the featuresets option, which requires two sets of brackets, since multiple sets of different-yet-related features can be provided. We will keep our examples simple, however, and only use one set of features per experiment. The label_col and id_col settings specify the columns in the CSV files that specify the class labels and instances IDs for each example.

The Tuning section defines how we want our model to be tuned. Setting grid_search to True here employs scikit-learn’s GridSearchCV class, which is an implementation of the standard, brute-force approach to hyperparameter optimization.

objectives refers to the desired objective functions; here, accuracy will optimize for overall accuracy. You can see a list of all the available objective functions here.

In the Output section, we first define the additional evaluation metrics we want to compute in addition to the tuning objective via the metrics option. The other options are directories where you’d like all of the relevant output from your experiment to go. results refers to the results of the experiment in both human-readable and JSON forms. log specifies where to put log files containing any status, warning, or error messages generated during model training and evaluation. predictions refers to where to store the individual predictions generated for the test set. models is for specifying a directory to serialize the trained models.

Running your configuration file through run_experiment¶

Getting your experiment running is the simplest part of using SKLL, you just need to type the following into a terminal:

Creating sparse files¶

skll_convert can also create sparse data files in .jsonlines, .libsvm, .megam, or .ndj formats. This is very useful for saving disk space and memory when you have a large data set with mostly zero-valued features.

Training and testing directories¶

At minimum you will probably want to work with a training set and a testing set. If you have multiple feature files that you would like SKLL to join together for you automatically, you will need to create feature files with the exact same names and store them in training and testing directories. You can specifiy these directories in your config file using train_directory and test_directory. The list of files is specified using the featuresets setting.

Single-file training and testing sets¶

If you’re conducting a simpler experiment, where you have a single training file with all of your features and a similar single testing file, you should use the train_file and test_file settings in your config file.

If you would like to split an existing file up into a training set and a testing set, you can employ the filter_features tool to select instances you would like to include in each file.

Creating a configuration file¶

Now that you’ve seen a basic configuration file, you should look at the extensive option available in our config file reference.

Running your configuration file through run_experiment¶

There are a few meta-options for experiments that are specified directly to the run_experiment command rather than in a configuration file. For example, if you would like to run an ablation experiment, which conducts repeated experiments using different combinations of the features in your config, you should use the run_experiment --ablation option. A complete list of options is available here.