Contributing
Thank you for your interest in contributing to SKLL! We welcome any and all contributions.
Guidelines
The SKLL contribution guidelines can be found in our Github repository here. We strongly encourage all SKLL contributions to follow these guidelines.
SKLL Code Overview
This section will help you get oriented with the SKLL codebase by describing how it is organized, the various SKLL entry points into the code, and what the general code flow looks like for each entry point.
Organization
The main Python code for the SKLL package lives inside the skll
sub-directory of the repository. It contains the following files and sub-directories:
config/ : Code to parse SKLL experiment configuration files.
experiments/ : Code that is related to creating and running SKLL experiments. It also contains code that collects the various evaluation metrics and predictions for each SKLL experiment and writes them out to disk.
learner/ : Code for the Learner and VotingLearner classes. The former is instantiated for all learner names specified in the experiment configuration file except
VotingClassifier
andVotingRegressor
for which the latter is instantiated instead.metrics.py : Code for any custom metrics that are not in
sklearn.metrics
, e.g.,kappa
,kendall_tau
,spearman
, etc. This module also contains the code that powers user-defined custom metrics.-
__init__.py : Code used to initialize the
skll.data
Python package.featureset.py : Code for the
FeatureSet
class metadata for a given set of instances.readers.py : Code for classes that can read various file formats and create
FeatureSet
objects from them.writers.py : Code for classes that can write
FeatureSet
objects to files on disk in various formats.dict_vectorizer.py : Code for a
DictVectorizer
class that subclassessklearn.feature_extraction.DictVectorizer
to add an__eq__()
method that we need for vectorizer equality.
utils/ : Code for different utility scripts, functions, and classes used throughout SKLL. The most important ones are the command line scripts in the
utils.commandline
submodule.filter_features.py : See documentation.
join_features.py : See documentation.
run_experiment.py : See documentation.
skll_convert.py : See documentation.
version.py : Code to define the SKLL version. Only changed for new releases.
tests/ -
test_*.py
: These files contain the code for the unit tests and regression tests.
Entry Points & Workflow
There are three main entry points into the SKLL codebase:
Experiment configuration files. The primary way to interact with SKLL is by writing configuration files and then passing it to the run_experiment script. When you run the command
run_experiment <config_file>
, the following happens (at a high level):the configuration file is handed off to the run_configuration() function in
experiments.py
.a SKLLConfigParser object is instantiated from
config.py
that parses all of the relevant fields out of the given configuration file.the configuration fields are then passed to the _classify_featureset() function in
experiments.py
which instantiates the learners (using code fromlearner.py
), the featuresets (using code fromreader.py
&featureset.py
), and runs the experiments, collects the results, and writes them out to disk.
SKLL API. Another way to interact with SKLL is via the SKLL API directly in your Python code rather than using configuration files. For example, you could use the Learner.from_file() or VotingLearner.from_file() methods to load saved models of those types from disk and make predictions on new data. The documentation for the SKLL API can be found here.
Utility scripts. The scripts listed in the section above under
utils
are also entry points into the SKLL code. These scripts are convenient wrappers that use the SKLL API for commonly used tasks, e.g., generating predictions on new data from an already trained model.