experiments Package

Functions for running and interacting with SKLL experiments.

author:

Nitin Madnani (nmadnani@ets.org)

author:

Dan Blanchard (dblanchard@ets.org)

author:

Michael Heilman (mheilman@ets.org)

author:

Chee Wee Leong (cleong@ets.org)

skll.experiments.generate_learning_curve_plots(experiment_name, output_dir, learning_curve_tsv_file)[source]

Generate learning curves using the TSV output file from a learning curve experiment.

This function generates both the score plots as well as the fit time plots.

Parameters:
  • experiment_name (str) – The name of the experiment.

  • output_dir (skll.types.PathOrStr) – Path to the output directory for the plots.

  • learning_curve_tsv_file (skll.types.PathOrStr) – The path to the learning curve TSV file.

Return type:

None

skll.experiments.load_featureset(dir_path, feat_files, suffix, id_col='id', label_col='y', ids_to_floats=False, quiet=False, class_map=None, feature_hasher=False, num_features=None, logger=None)[source]

Load a list of feature files and merge them.

Parameters:
  • dir_path (skll.types.PathOrStr) – Path to the directory that contains the feature files.

  • feat_files (List[str]) – A list of feature file prefixes.

  • suffix (str) – The suffix to add to feature file prefixes to get the full filenames.

  • id_col (str, default="id") – Name of the column which contains the instance IDs. If no column with that name exists, or None is specified, example IDs will be automatically generated.

  • label_col (str, default="y") – Name of the column which contains the class labels. If no column with that name exists, or None is specified, the data is considered to be unlabeled.

  • ids_to_floats (bool, default=False) – Whether to convert the IDs to floats to save memory. Will raise error if we encounter non-numeric IDs.

  • quiet (bool, default=False) – Do not print “Loading…” status message to stderr.

  • class_map (Optional[skll.types.ClassMap], default=None) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same.

  • feature_hasher (bool, default=False) – Should we use a FeatureHasher when vectorizing features?

  • num_features (Optional[int], default=None) – The number of features to use with the FeatureHasher. This should always be set to the power of 2 greater than the actual number of features you’re using.

  • logger (Optional[logging.Logger], default=None) – A logger instance to use to log messages instead of creating a new one by default.

Returns:

merged_set – A FeatureSet instance containing the specified labels, IDs, features, and feature vectorizer.

Return type:

skll.data.featureset.FeatureSet

skll.experiments.run_configuration(config_file, local=False, overwrite=True, queue='all.q', hosts=None, write_summary=True, quiet=False, ablation=0, resume=False, log_level=20)[source]

Run jobs specified in the configuration file locally or on the grid.

Parameters:
  • config_file (skll.types.PathOrStr) – Path to the configuration file we would like to use.

  • local (bool, default=False) – Should this be run locally instead of on the cluster?

  • overwrite (bool, default=True) – If the model files already exist, should we overwrite them instead of re-using them?

  • queue (str, default="all.q") – The DRMAA queue to use if we’re running on the cluster.

  • hosts (Optional[List[str]], default=None) – If running on the cluster, these are the machines we should use.

  • write_summary (bool, default=True) – Write a TSV file with a summary of the results.

  • quiet (bool, default=False) – Suppress printing of “Loading…” messages.

  • ablation (int, default=0) – Number of features to remove when doing an ablation experiment. If positive, we will perform repeated ablation runs for all combinations of features removing the specified number at a time. If None, we will use all combinations of all lengths. If 0, the default, no ablation is performed. If negative, a ValueError is raised.

  • resume (bool, default=False) – If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes.

  • log_level (int, default=logging.INFO) – The level for logging messages.

Returns:

result_json_paths – A list of paths to .json results files for each variation in the experiment.

Return type:

List[str]

Raises:
  • ValueError – If value for "ablation" is not a positive int or None.

  • OSError – If the lenth of the FeatureSet name > 210.