3 min read

model tuning with automl in auto-sklearn: Part 1

I went to a talk at Bristech on the subject of autoML in Azure. I have to say I was pretty convinced the autoML is the future. Azure’s autoML is built on sklearn (and therefore it’s a python-only SDK), so I did a big of googling and it turns out the autoML with sklearn is actually open-source. You just need to install the ‘auto’sklearn’ package onto your computer and you’re good to go.

A few provisos with installing auto-sklean.

  1. You need a linux environment to run it. My laptop runs Ubuntu, so I didn’t need to do anything but if you’re not on linux then you’ll need a VM (or a container).
  2. The dependencies might clash with other versions of packages you have installed. E.g. I had a different version of pandas or numpy or something and, although I went through the auto-sklearn intallation step detailed here (yes, auto-sklearn has a very useful website) it wouldn’t run. The way to solve this is to create a fresh environment (I use conda) with
conda create -n automl python=3.6
conda activate automl

and then follow the installation instructions from the command line and you’ll get an environment excactly as auto-sklearn wants it. Once that’s done, start a python session from within your automl environment and run the auto-sklearn example.

As a side point, since I’m actually running this example from within an rmarkdown document, and the python engine for rmakdown is provided by the reticulate package, I’ll tell reticulate with conda env to use to the run the python code.

use_condaenv(condaenv = "automl")

import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
## [WARNING] [2019-03-17 18:25:11,588:smac.intensification.intensification.Intensifier] Challenger was the same as the current incumbent; Skipping challenger
## [WARNING] [2019-03-17 18:25:11,588:smac.intensification.intensification.Intensifier] Challenger was the same as the current incumbent; Skipping challenger
## AutoSklearnClassifier(delete_output_folder_after_terminate=True,
##            delete_tmp_folder_after_terminate=True,
##            disable_evaluator_output=False, ensemble_memory_limit=1024,
##            ensemble_nbest=50, ensemble_size=50, exclude_estimators=None,
##            exclude_preprocessors=None, get_smac_object_callback=None,
##            include_estimators=None, include_preprocessors=None,
##            initial_configurations_via_metalearning=25, logging_config=None,
##            ml_memory_limit=3072, n_jobs=None, output_folder=None,
##            per_run_time_limit=360, resampling_strategy='holdout',
##            resampling_strategy_arguments=None, seed=1, shared_mode=False,
##            smac_scenario_args=None, time_left_for_this_task=3600,
##            tmp_folder=None)
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))
## Accuracy score 0.9866666666666667

The automl.fit takes about an hour to run, and I’m actually running it in rmarkdown, for the blog page, so the page take ages to render. Now, the model you’ll get is a pretty large object (the one I got is roughly 500MB), and since it took an hour to build, you’ll probably want to save the model object with.

filename = "automl_test_model.sav"
pickle.dump(automl, open(filename, 'wb'))

and it’s that simple! Actually auto-sklearn has a lot of other options for specficying which cross-validation method to use, and how long to test different models for etc. But like I said, this page already takes and hour to render, so I think other examples with have to be or another post.