experiments Package

Functions for running and interacting with SKLL experiments.

author:Nitin Madnani (nmadnani@ets.org)
author:Dan Blanchard (dblanchard@ets.org)
author:Michael Heilman (mheilman@ets.org)
author:Chee Wee Leong (cleong@ets.org)
skll.experiments.generate_learning_curve_plots(experiment_name, output_dir, learning_curve_tsv_file)[source]

Generate the learning curve plots given the TSV output file from a learning curve experiment.

Parameters:
  • experiment_name (str) – The name of the experiment.
  • output_dir (str) – Path to the output directory for the plots.
  • learning_curve_tsv_file (str) – The path to the learning curve TSV file.
skll.experiments.load_featureset(dir_path, feat_files, suffix, id_col='id', label_col='y', ids_to_floats=False, quiet=False, class_map=None, feature_hasher=False, num_features=None, logger=None)[source]

Load a list of feature files and merge them.

Parameters:
  • dir_path (str) – Path to the directory that contains the feature files.
  • feat_files (list of str) – A list of feature file prefixes.
  • suffix (str) – The suffix to add to feature file prefixes to get the full filenames.
  • id_col (str, optional) – Name of the column which contains the instance IDs. If no column with that name exists, or None is specified, example IDs will be automatically generated. Defaults to 'id'.
  • label_col (str, optional) – Name of the column which contains the class labels. If no column with that name exists, or None is specified, the data is considered to be unlabeled. Defaults to 'y'.
  • ids_to_floats (bool, optional) – Whether to convert the IDs to floats to save memory. Will raise error if we encounter non-numeric IDs. Defaults to False.
  • quiet (bool, optional) – Do not print “Loading…” status message to stderr. Defaults to False.
  • class_map (dict, optional) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same. Defaults to None.
  • feature_hasher (bool, optional) – Should we use a FeatureHasher when vectorizing features? Defaults to False.
  • num_features (int, optional) – The number of features to use with the FeatureHasher. This should always be set to the power of 2 greater than the actual number of features you’re using. Defaults to None.
  • logger (logging.Logger, optional) – A logger instance to use to log messages instead of creating a new one by default. Defaults to None.
Returns:

merged_set – A FeatureSet instance containing the specified labels, IDs, features, and feature vectorizer.

Return type:

skll.FeatureSet

skll.experiments.run_configuration(config_file, local=False, overwrite=True, queue='all.q', hosts=None, write_summary=True, quiet=False, ablation=0, resume=False, log_level=20)[source]

Takes a configuration file and runs the specified jobs on the grid.

Parameters:
  • config_file (str) – Path to the configuration file we would like to use.
  • local (bool, optional) – Should this be run locally instead of on the cluster? Defaults to False.
  • overwrite (bool, optional) – If the model files already exist, should we overwrite them instead of re-using them? Defaults to True.
  • queue (str, optional) – The DRMAA queue to use if we’re running on the cluster. Defaults to 'all.q'.
  • hosts (list of str, optional) – If running on the cluster, these are the machines we should use. Defaults to None.
  • write_summary (bool, optional) – Write a TSV file with a summary of the results. Defaults to True.
  • quiet (bool, optional) – Suppress printing of “Loading…” messages. Defaults to False.
  • ablation (int, optional) – Number of features to remove when doing an ablation experiment. If positive, we will perform repeated ablation runs for all combinations of features removing the specified number at a time. If None, we will use all combinations of all lengths. If 0, the default, no ablation is performed. If negative, a ValueError is raised. Defaults to 0.
  • resume (bool, optional) – If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes. Defaults to False.
  • log_level (str, optional) – The level for logging messages. Defaults to logging.INFO.
Returns:

result_json_paths – A list of paths to .json results files for each variation in the experiment.

Return type:

list of str

Raises:
  • ValueError – If value for "ablation" is not a positive int or None.
  • OSError – If the lenth of the FeatureSet name > 210.