experiments
Package
Functions for running and interacting with SKLL experiments.
- author:
Nitin Madnani (nmadnani@ets.org)
- author:
Dan Blanchard (dblanchard@ets.org)
- author:
Michael Heilman (mheilman@ets.org)
- author:
Chee Wee Leong (cleong@ets.org)
- skll.experiments.generate_learning_curve_plots(experiment_name, output_dir, learning_curve_tsv_file)[source]
Generate learning curves using the TSV output file from a learning curve experiment.
This function generates both the score plots as well as the fit time plots.
- Parameters:
experiment_name (str) – The name of the experiment.
output_dir (
skll.types.PathOrStr
) – Path to the output directory for the plots.learning_curve_tsv_file (
skll.types.PathOrStr
) – The path to the learning curve TSV file.
- Return type:
None
- skll.experiments.load_featureset(dir_path, feat_files, suffix, id_col='id', label_col='y', ids_to_floats=False, quiet=False, class_map=None, feature_hasher=False, num_features=None, logger=None)[source]
Load a list of feature files and merge them.
- Parameters:
dir_path (
skll.types.PathOrStr
) – Path to the directory that contains the feature files.feat_files (List[str]) – A list of feature file prefixes.
suffix (str) – The suffix to add to feature file prefixes to get the full filenames.
id_col (str, default="id") – Name of the column which contains the instance IDs. If no column with that name exists, or None is specified, example IDs will be automatically generated.
label_col (str, default="y") – Name of the column which contains the class labels. If no column with that name exists, or None is specified, the data is considered to be unlabeled.
ids_to_floats (bool, default=False) – Whether to convert the IDs to floats to save memory. Will raise error if we encounter non-numeric IDs.
quiet (bool, default=False) – Do not print “Loading…” status message to stderr.
class_map (Optional[
skll.types.ClassMap
], default=None) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same.feature_hasher (bool, default=False) – Should we use a FeatureHasher when vectorizing features?
num_features (Optional[int], default=None) – The number of features to use with the
FeatureHasher
. This should always be set to the power of 2 greater than the actual number of features you’re using.logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.
- Returns:
merged_set – A
FeatureSet
instance containing the specified labels, IDs, features, and feature vectorizer.- Return type:
- skll.experiments.run_configuration(config_file, local=False, overwrite=True, queue='all.q', hosts=None, write_summary=True, quiet=False, ablation=0, resume=False, log_level=20)[source]
Run jobs specified in the configuration file locally or on the grid.
- Parameters:
config_file (
skll.types.PathOrStr
) – Path to the configuration file we would like to use.local (bool, default=False) – Should this be run locally instead of on the cluster?
overwrite (bool, default=True) – If the model files already exist, should we overwrite them instead of re-using them?
queue (str, default="all.q") – The DRMAA queue to use if we’re running on the cluster.
hosts (Optional[List[str]], default=None) – If running on the cluster, these are the machines we should use.
write_summary (bool, default=True) – Write a TSV file with a summary of the results.
quiet (bool, default=False) – Suppress printing of “Loading…” messages.
ablation (int, default=0) – Number of features to remove when doing an ablation experiment. If positive, we will perform repeated ablation runs for all combinations of features removing the specified number at a time. If
None
, we will use all combinations of all lengths. If 0, the default, no ablation is performed. If negative, aValueError
is raised.resume (bool, default=False) – If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes.
log_level (int, default=logging.INFO) – The level for logging messages.
- Returns:
result_json_paths – A list of paths to .json results files for each variation in the experiment.
- Return type:
List[str]
- Raises:
ValueError – If value for
"ablation"
is not a positive int orNone
.OSError – If the lenth of the
FeatureSet
name > 210.