array(1) {
  array(13) {
    string(2) "en"
    string(1) "1"
    string(7) "English"
    string(1) "1"
    string(1) "1"
    string(5) "en_US"
    string(1) "0"
    string(2) "en"
    string(7) "English"
    string(100) "https://www.statworx.com/en/content-hub/blog/a-performance-benchmark-of-different-automl-frameworks/"
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/en.png"
    string(2) "en"
Content Hub
Blog Post

A Performance Benchmark of Different AutoML Frameworks

  • Expert Fabian Müller
  • Date 21. September 2018
  • Topic Machine LearningPythonR
  • Format Blog
  • Category Technology
A Performance Benchmark of Different AutoML Frameworks

In a recent blog post our CEO Sebastian Heinz wrote about Google’s newest stroke of genius – AutoML Vision. A cloud service “that is able to build deep learning models for image recognition completely fully automated and from scratch“. AutoML Vision is part of the current trend towards the automation of machine learning tasks. This trend started with automation of hyperparameter optimization for single models (Including services like SigOpt, Hyperopt, SMAC), went along with automated feature engineering and selection about our bounceR package) towards full automation of complete data pipelines including automated model stacking (a common model ensembling technique).

One company at the frontier of this development is certainly h2o.ai. They developed both a free Python/R library (H2O AutoML) as well as an enterprise-ready software solution called Driverless AI. But H2O is by far not the only player on the field. This blog post will provide you with a short comparison between two freely available Auto ML solutions and compare them by predictive performance as well as general usability.

H2O AutoML

H2O AutoML is an extension to H2O’s popular java based open source machine learning framework with APIs for Python and R. It automatically trains, tunes and cross-validates models (including Generalized Linear Models [GLM], Gradient Boosting Machines [GBM], Random Forest [RF], Extremely Randomized Forest [XRF], and Neural Networks). Hyperparameter optimization is done using a random search over a list of reasonable parameters (both RF and XRF are currently not tuned). In the end, H2O produces a leaderboard of models and builds two types of stacked ensembles from the base models. One including all base models, the other including only the best base model of each family.

Model training can be controlled by either the number of models to be trained, or the total training time. Especially the later makes model training quite transparent. One of the big advantages of H2O is that all models are parallelized out-of-the-box.


auto-sklearn is an automated machine learning toolkit based on Python’s Scikit-Learn Library. A detailed explanation of auto-sklearn can be found in Feurer et al. (2015). In H2O AutoML, each model was independently tuned and added to a leaderboard. In auto-sklearn, the authors combine model selection and hyperparameter optimization in what they call “Combined Algorithm Selection and Hyperparameter optimization” (CASH). This joint optimization problem is than solved using a tree-based Bayesian optimization methods called “Sequential Model-based Algorithm Configuration” (SMAC) (see Bergstra 2011).

So contrary to H2O AutoML, auto-sklearn optimizes a complete modeling pipeline including various data and feature preprocessing steps as well as the model selection and hyperparameter optimization. Data preprocessing includes one-hot-encoding, scaling, imputation, and balancing. Feature preprocessing includes, among others, feature agglomeration, ICA and PCA. Algorithms included in auto-sklearn are similar to those in H2O AutoML, but in addition also includes more traditional methods like k-Nearest-Neighbors (kNN), Naive Bayes, and Support Vector Machines (SVM).

Similar to H2O AutoML, auto-sklearn includes a final model ensemble step. Whereas H2O AutoML uses simple but efficient model stacking, auto-sklearn uses ensemble selection. A greedy method that adds individual models iteratively to the ensemble if and only if they increase the validation performance. Like H2O, auto-sklearn allows model training to be controlled by the total training time.


In order to compare the predictive performance of H2O’s AutoML with auto-sklearn, one can conduct a small simulation study. My colleague André’s R package Xy offers a straightforward way to simulate regression datasets with linear, non-linear, and noisy relationships. Using multiple (ten in total) simulation runs makes the whole simulation a bit more robust. The following R code was used to simulate the data:

<span class="hljs-attr">library(Xy)</span>
<span class="hljs-attr">library(caret)</span>
<span class="hljs-attr">library(dplyr)</span>
<span class="hljs-attr">library(data.table)</span>
<span class="hljs-comment">
# Number of datasets</span>
<span class="hljs-attr">n_data_set</span> <span class="hljs-string"><- 10</span>

<span class="hljs-attr">for</span> <span class="hljs-string">(i in seq(n_data_set)) {</span>
<span class="hljs-comment">
# Sim settings</span>
<span class="hljs-attr">n</span> <span class="hljs-string"><- floor(runif(1, 1000, 5000))</span>
<span class="hljs-attr">n_num_vars</span> <span class="hljs-string"><- c(sample(2:10, 1), sample(2:10, 1))</span>
<span class="hljs-attr">n_cat_vars</span> <span class="hljs-string"><- c(0, 0)</span>
<span class="hljs-attr">n_noise_vars</span> <span class="hljs-string"><- sample(1:5, 1)</span>
<span class="hljs-attr">inter_degree</span> <span class="hljs-string"><- sample(2:3, 1)</span>
<span class="hljs-comment">
# Simulate data</span>
<span class="hljs-attr">sim</span> <span class="hljs-string"><- Xy(n = n, </span>
<span class="hljs-attr">numvars</span> = <span class="hljs-string">n_num_vars,</span>
<span class="hljs-attr">catvars</span> = <span class="hljs-string">n_cat_vars, </span>
<span class="hljs-attr">noisevars</span> = <span class="hljs-string">n_noise_vars, </span>
<span class="hljs-attr">task</span> = <span class="hljs-string">Xy_task(),</span>
<span class="hljs-attr">nlfun</span> = <span class="hljs-string">function(x) {x^2},</span>
<span class="hljs-attr">interactions</span> = <span class="hljs-string">1,</span>
<span class="hljs-attr">sig</span> = <span class="hljs-string">c(1,4), </span>
<span class="hljs-attr">cor</span> = <span class="hljs-string">c(0),</span>
<span class="hljs-attr">weights</span> = <span class="hljs-string">c(-10,10),</span>
<span class="hljs-attr">intercept</span> = <span class="hljs-string">TRUE,</span>
<span class="hljs-attr">stn</span> = <span class="hljs-string">4)</span>
<span class="hljs-comment">
# Get data and DGP</span>
<span class="hljs-attr">df</span> <span class="hljs-string"><- simdata</span>   <span class="hljs-attr">dgp</span> <span class="hljs-string"><- simdgp</span>
<span class="hljs-comment">
# Remove Intercept</span>
<span class="hljs-meta">df[,</span> <span class="hljs-string">"(Intercept)"] <- NULL</span>
<span class="hljs-comment">
# Rename columns</span>
<span class="hljs-meta">names(df)</span> <span class="hljs-string"><- gsub("(?<![0-9])0+", "", names(df), perl = TRUE)</span>
<span class="hljs-comment">
# Create test/train split</span>
<span class="hljs-attr">df</span> <span class="hljs-string"><- dplyr::rename(df, label = y)</span>
<span class="hljs-attr">in_train</span> <span class="hljs-string"><- createDataPartition(y = dflabel, p = 0.7, list = FALSE)</span>   <span class="hljs-attr">df_train</span> <span class="hljs-string"><- df[in_train, ]</span>   <span class="hljs-attr">df_test</span> <span class="hljs-string"><- df[-in_train, ]</span> <span class="hljs-comment">     # Path names</span>   <span class="hljs-attr">path_train</span> <span class="hljs-string"><- paste0("../data/Xy/", i, "_train.csv")</span>   <span class="hljs-attr">path_test</span> <span class="hljs-string"><- paste0("../data/Xy/", i, "_test.csv")</span> <span class="hljs-comment">     # Export</span>   <span class="hljs-meta">fwrite(df_train,</span> <span class="hljs-string">file = path_train)</span>   <span class="hljs-meta">fwrite(df_test,</span> <span class="hljs-string">file = path_test)</span>    <span class="hljs-attr">}</span> </code></pre> Since auto-sklearn is only available in Python, switching languages is necessary. Therefore, loading the raw data in Python is the next step: <pre><code class="language-python hljs"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd  <span class="hljs-comment"># Load data</span> df_train = pd.read_csv(<span class="hljs-string">"../data/Xy/1_train.csv"</span>) df_test = pd.read_csv(<span class="hljs-string">"../data/Xy/1_test.csv"</span>)  <span class="hljs-comment"># Columns</span> cols_train = df_train.columns.tolist() cols_test = df_test.columns.tolist()  <span class="hljs-comment"># Target and features</span> y_train = df_train.loc[:, <span class="hljs-string">"label"</span>] X_train = df_train.drop(<span class="hljs-string">"label"</span>, axis=<span class="hljs-number">1</span>)  y_test = df_test.loc[:, <span class="hljs-string">"label"</span>] X_test = df_test.drop(<span class="hljs-string">"label"</span>, axis=<span class="hljs-number">1</span>) </code></pre> Having the data in Python, the training procedure can start. In order to make the results comparable, both frameworks used, where possible, similar settings. This included 60 minutes of training for each dataset, 5-fold crossvalidation for model evaluation and ensemble building, no preprocessing (not available in H2O AutoML and therefore deactivated in auto-sklearn), and a limitation to similar algorithms (namely GLM, RF, XRF, and GBM).  As previously noted, H2O supports out-of-the-box parallelization. By default, auto-sklearn only uses two cores, while also supporting more cores, at least in theory. While there is a <a href="https://automl.github.io/auto-sklearn/stable/manual.html#parallel-computation">manual</a> on how to do that, I was not able to get it working on my system (OSX 10.13, Python 3.6.2 Anaconda). Therefore H2O was also limited to only two cores. <pre><code class="language-python hljs"><span class="hljs-keyword">from</span> autosklearn.regression <span class="hljs-keyword">import</span> AutoSklearnRegressor <span class="hljs-keyword">from</span> autosklearn.metrics <span class="hljs-keyword">import</span> mean_squared_error  <span class="hljs-comment"># Settings</span> estimators_to_use = [<span class="hljs-string">"random_forest"</span>, <span class="hljs-string">"extra_trees"</span>, <span class="hljs-string">"gradient_boosting"</span>, <span class="hljs-string">"ridge_regression"</span>] preprocessing_to_use = [<span class="hljs-string">"no_preprocessing"</span>]  <span class="hljs-comment"># Init auto-sklearn</span> auto_sklearn = AutoSklearnRegressor(time_left_for_this_task=<span class="hljs-number">60</span>*<span class="hljs-number">60</span>,                                     per_run_time_limit=<span class="hljs-number">360</span>,                                     include_estimators=estimators_to_use,                                     exclude_estimators=<span class="hljs-literal">None</span>,                                     include_preprocessors=preprocessing_to_use,                                     exclude_preprocessors=<span class="hljs-literal">None</span>,                                     ml_memory_limit=<span class="hljs-number">6156</span>,                                     resampling_strategy=<span class="hljs-string">"cv"</span>,                                     resampling_strategy_arguments={<span class="hljs-string">"folds"</span>: <span class="hljs-number">5</span>})  <span class="hljs-comment"># Train models</span> auto_sklearn.fit(X=X_train.copy(), y=y_train.copy(), metric=mean_squared_error) it_fits = auto_sklearn.refit(X=X_train.copy(), y=y_train.copy())  <span class="hljs-comment"># Predict</span> y_hat = auto_sklearn.predict(X_test)  <span class="hljs-comment"># Show results</span> auto_sklearn.cv_results_ auto_sklearn.sprint_statistics() auto_sklearn.show_models() auto_sklearn.get_models_with_weights() </code></pre> <pre><code class="language-python hljs"><span class="hljs-keyword">import</span> h2o <span class="hljs-keyword">from</span> h2o.automl <span class="hljs-keyword">import</span> H2OAutoML  <span class="hljs-comment"># Shart h2o cluster</span> h2o.init(max_mem_size=<span class="hljs-string">"8G"</span>, nthreads=<span class="hljs-number">2</span>)  <span class="hljs-comment"># Upload to h2o</span> df_train_h2o = h2o.H2OFrame(pd.concat([X_train, pd.DataFrame({<span class="hljs-string">"target"</span>: y_train})], axis=<span class="hljs-number">1</span>)) df_test_h2o = h2o.H2OFrame(X_test)  features = X_train.columns.values.tolist() target = <span class="hljs-string">"target"</span>  <span class="hljs-comment"># Training</span> auto_h2o = H2OAutoML(max_runtime_secs=<span class="hljs-number">60</span>*<span class="hljs-number">60</span>) auto_h2o.train(x=features,                y=target,                training_frame=df_train_h2o)  <span class="hljs-comment"># Leaderboard</span> auto_h2o.leaderboard auto_h2o = auto_h2o.leader  <span class="hljs-comment"># Testing</span> df_test_hat = auto_h2o.predict(df_test_h2o) y_hat = h2o.as_list(df_test_hat[<span class="hljs-string">"predict"</span>])  <span class="hljs-comment"># Close cluster</span> h2o.cluster().shutdown() </code></pre> The complete code, including all simulation runs and visualization of results can be find on my <a href="https://github.com/fabianmax/ML-Automation">GitHub repo</a>. <h2>Results</h2> First, some words of caution: The results presented in the next sections are by no mean representative. Both H2O and the authors of auto-sklearn recommend to run their frameworks for hours, if not even days. Given ten different datasets, this was beyond the scope of a blog post. For the same reason of feasibility, the datasets are restricted to a rather small size. For a more elaborated performance comparison see for example Balaji and Allen (2018).  Figure 1 shows the Mean Squared Error of both frameworks produced on the test sample. The horizontal line, indicating the result from a vanilla Random Forest (from scikit-learn), serves as a benchmark. As one can see, the results are pretty similar for both frameworks and all data sets. Actually, it is a tie, with five wins for H2O and five wins for auto-sklearn.  <img class="aligncenter size-full wp-image-15313" src="https://www.statworx.com/wp-content/uploads/results-ml-benchmark.png" alt="results ml benchmark" width="1920" height="857" />  The percentage difference between the average errors is1.04%in favor of auto-sklearn. Thus, auto-sklearn is on average about1%better than H2O. Compared with the vanilla RF, H2O's AutoML is on average23.4%better than the benchmark, while auto-sklearn is24.6%$ better.

The sheer closeness of the results can be further illustrated when taking a look at the predicted values. Figure 2 shows exemplary the predicted values for one particular dataset against all feature values (linear, non-linear and noise features). As one can see, the estimated effects for both frameworks are almost identical and pretty close to the actual relationship.

visualization ml benchmark


Automatic Machine Learning frameworks can provide promising results for standard machine learning task while keeping the manual efforts down to a minimum. This blog post compared two popular frameworks, namely H2O's AutoML and auto-sklearn. Both reached comparable results on ten simulated datasets, while outperforming vanilla models significantly. Beside predictive performance, H2O's AutoML offers some additional features like native parallelization, API for R, support for XGBoost and GPU training making it even more attractive.


Fabian Müller Fabian Müller

Learn more!

As one of the leading companies in the field of data science, machine learning, and AI, we guide you towards a data-driven future. Learn more about statworx and our motivation.
About us