Contributing

Thank you for your interest in contributing to SKLL! We welcome any and all contributions.

Guidelines

The SKLL contribution guidelines can be found in our Github repository here. We strongly encourage all SKLL contributions to follow these guidelines.

SKLL Code Overview

This section will help you get oriented with the SKLL codebase by describing how it is organized, the various SKLL entry points into the code, and what the general code flow looks like for each entry point.

Organization

The main Python code for the SKLL package lives inside the skll sub-directory of the repository. It contains the following files and sub-directories:

config/ : Code to parse SKLL experiment configuration files.
experiments/ : Code that is related to creating and running SKLL experiments. It also contains code that collects the various evaluation metrics and predictions for each SKLL experiment and writes them out to disk.
learner/ : Code for the Learner and VotingLearner classes. The former is instantiated for all learner names specified in the experiment configuration file except VotingClassifier and VotingRegressor for which the latter is instantiated instead.
metrics.py : Code for any custom metrics that are not in sklearn.metrics, e.g., kappa, kendall_tau, spearman, etc. This module also contains the code that powers user-defined custom metrics.
data/
- __init__.py : Code used to initialize the skll.data Python package.
- featureset.py : Code for the FeatureSet class metadata for a given set of instances.
- readers.py : Code for classes that can read various file formats and create FeatureSet objects from them.
- writers.py : Code for classes that can write FeatureSet objects to files on disk in various formats.
- dict_vectorizer.py : Code for a DictVectorizer class that subclasses sklearn.feature_extraction.DictVectorizer to add an __eq__() method that we need for vectorizer equality.
utils/ : Code for different utility scripts, functions, and classes used throughout SKLL. The most important ones are the command line scripts in the utils.commandline submodule.
- compute_eval_from_predictions.py : See documentation.
- filter_features.py : See documentation.
- generate_predictions.py : See documentation.
- join_features.py : See documentation.
- plot_learning_curves.py : See documentation.
- print_model_weights.py : See documentation.
- run_experiment.py : See documentation.
- skll_convert.py : See documentation.
- summarize_results.py : See documentation.
version.py : Code to define the SKLL version. Only changed for new releases.
tests/ - test_*.py : These files contain the code for the unit tests and regression tests.

Entry Points & Workflow

There are three main entry points into the SKLL codebase:

Experiment configuration files. The primary way to interact with SKLL is by writing configuration files and then passing it to the run_experiment script. When you run the command run_experiment <config_file>, the following happens (at a high level):
- the configuration file is handed off to the run_configuration() function in experiments.py.
- a SKLLConfigParser object is instantiated from config.py that parses all of the relevant fields out of the given configuration file.
- the configuration fields are then passed to the _classify_featureset() function in experiments.py which instantiates the learners (using code from learner.py), the featuresets (using code from reader.py & featureset.py), and runs the experiments, collects the results, and writes them out to disk.
SKLL API. Another way to interact with SKLL is via the SKLL API directly in your Python code rather than using configuration files. For example, you could use the Learner.from_file() or VotingLearner.from_file() methods to load saved models of those types from disk and make predictions on new data. The documentation for the SKLL API can be found here.
Utility scripts. The scripts listed in the section above under utils are also entry points into the SKLL code. These scripts are convenient wrappers that use the SKLL API for commonly used tasks, e.g., generating predictions on new data from an already trained model.