array(1) {
  array(13) {
    string(2) "en"
    string(1) "1"
    string(7) "English"
    string(1) "1"
    string(1) "1"
    string(5) "en_US"
    string(1) "0"
    string(2) "en"
    string(7) "English"
    string(113) "https://www.statworx.com/en/content-hub/blog/a-performance-benchmark-of-google-automl-vision-using-fashion-mnist/"
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/en.png"
    string(2) "en"
Content Hub
Blog Post

A Performance Benchmark of Google AutoML Vision Using Fashion-MNIST

  • Expert Sebastian Heinz
  • Date 20. August 2018
  • Topic Cloud TechnologyDeep LearningMachine Learning
  • Format Blog
  • Category Technology
A Performance Benchmark of Google AutoML Vision Using Fashion-MNIST

Google AutoML Vision is a state-of-the-art cloud service from Google that is able to build deep learning models for image recognition completely fully automated and from scratch. In this post, Google AutoML Vision is used to build an image classification model on the Zalando Fashion-MNIST dataset, a recent variant of the classical MNIST dataset, which is considered to be more difficult to learn for ML models, compared to digit MNIST.

During the benchmark, both AutoML Vision training modes, “free” (0 $, limited to 1 hour computing time) and “paid” (approx. 500 $, 24 hours computing time) were used and evaluated:

Thereby, the free AutoML model achieved a macro AUC of 96.4% and an accuracy score of 88.9% on the test set at a computing time of approx. 30 minutes (early stopping). The paid AutoML model achieved a macro AUC of 98.5% on the test set with an accuracy score of 93.9%.


Recently, there is a growing interest in automated machine learning solutions. Products like H2O Driverless AI or DataRobot, just to name a few, aim at corporate customers and continue to make their way into professional data science teams and environments. For many use cases, AutoML solutions can significantly speed up time-2-model cycles and therefore allow for faster iteration and deployment of models (and actually start saving / making money in production).

Automated machine learning solutions will transform the data science and ML landscape substantially in the next 3-5 years. Thereby, many ML models or applications that nowadays require respective human input or expertise will likely be partly or fully automated by AI / ML models themselves. Likely, this will also yield a decline in overall demand for “classical” data science profiles in favor of more engineering and operations related data science roles that bring models into production.

A recent example of the rapid advancements in automated machine learning this is the development of deep learning image recognition models. Not too long ago, building an image classifier was a very challenging task that only few people were acutally capable of doing. Due to computational, methodological and software advances, barriers have been dramatically lowered to the point where you can build your first deep learning model with Keras in 10 lines of Python code and getting “okayish” results.

Undoubtly, there will still be many ML applications and cases that cannot be (fully) automated in the near future. Those cases will likely be more complex because basic ML tasks, such as fitting a classifier to a simple dataset, can and will easily be automated by machines.

At this point, first attempts in moving into the direction of machine learning automation are made. Google as well as other companies are investing in AutoML research and product development. One of the first professional automated ML products on the market is Google AutoML Vision.

Google AutoML Vision

Google AutoML Vision (at this point in beta) is Google’s cloud service for automated machine learning for image classification tasks. Using AutoML Vision, you can train and evaluate deep learning models without any knowledge of coding, neural networks or whatsoever.

AutoML Vision operates in the Google Cloud and can be used either based on a graphical user interface or via, REST, command line or Python. AutoML Vision implements strategies from Neural Architecture Search (NAS), currently a scientific field of high interest in deep learning research. NAS is based on the idea that another model, typically a neural network or reinforcement learning model, is designing the architecture of the neural network that aims to solve the machine learning task. Cornerstones in NAS research were the paper from Zoph et at. (2017) as well as Pham et al. (2018). The latter has also been implemented in the Python package autokeras (currently in pre-release phase) and makes neural architecture search feasible on desktop computers with a single GPU opposed to 500 GPUs used in Zoph et al.

The idea that an algorithm is able to discover architectures of a neural network seems very promising, however is still kind of limited due to computational contraints (I hope you don’t mind that I consider a 500-1000 GPU cluster as as computational contraint). But how good does neural architecture search actually work in a pre-market-ready product?


In the following section, Google AutoML vision is used to build an image recognition model based on the Fashion-MNIST dataset.


The Fashion-MNIST dataset is supposed to serve as a “drop-in replacement” for the traditional MNIST dataset and has been open-sourced by Europe’s online fashion giant Zalando‘s research department (check the Fashion-MNIST GitHub repo and the Zalando reseach website). It contains 60,000 training and 10,000 test images of 10 different clothing categories (tops, pants, shoes etc.). Just like in MNIST, each image is a 28×28 grayscale image. It shares the same image size and structure of training and test images. Below are some examples from the dataset:

The makers of Fashion-MNIST argue, that nowadays the traditional MNIST dataset is a too simple task to solve – even simple convolutional neural networks achieve >99% accuracy on the test set whereas classical ML algorithms easily score >97%. For this and other reasons, Fashion-MNIST was created.

The Fashion-MNIST repo contains helper functions for loading the data as well as some scripts for benchmarking and testing your models. Also, there’s a neat visualization of an ebmedding of the data on the repo. After cloning, you can import the Fashion-MNIST data using a simple Python function (check the code in the next section) and start to build your model.

Using Google AutoML Vision

Preparing the data

AutoML offers two ways of data ingestion: (1) upload a zip file that contains the training images in different folders, corresponding to the respective labels or (2) upload a CSV file that contains the Goolge cloud storage (GS) filepaths, labels and optionally the data partition for training, validation and test set. I decided to go with the CSV file because you can define the data partition (flag names are TRAIN, VALIDATION and TEST) in order to keep control over the experiment. Below is the required structure of the CSV file that needs to be uploaded to AutoML Vision (without the header!).

partition file label
TRAIN gs://bucket-name/folder/image_0.jpg 0
TRAIN gs://bucket-name/folder/image_1.jpg 2
VALIDATION gs://bucket-name/folder/image_22201.jpg 7
VALIDATION gs://bucket-name/folder/image_22202.jpg 9
TEST gs://bucket-name/folder/image_69998.jpg 4
TEST gs://bucket-name/folder/image_69999.jpg 1

Just like MNIST, Fashion-MNIST data contains the pixel values of the respective images. To actually upload image files, I developed a short python script that takes care of the image creation, export and upload to GCP. The script iterates over each row of the Fashion-MNIST dataset, exports the image and uploads it into a Google Cloud storage bucket.

import os
import gzip
import numpy as np
import pandas as pd
from google.cloud import storage
from keras.preprocessing.image import array_to_img

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               % kind)
    images_path = os.path.join(path,
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

# Import training data
X_train, y_train = load_mnist(path='data', kind='train')
X_test, y_test = load_mnist(path='data', kind='t10k')

# Split validation data
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=10000)

# Dataset placeholder
files = pd.DataFrame({'part': np.concatenate([np.repeat('TRAIN', 50000),
                                              np.repeat('VALIDATION', 10000),
                                              np.repeat('TEST', 10000)]),
                      'file': np.repeat('file', 70000),
                      'label': np.repeat('label', 70000)})

# Stack training and test data into single arrays
X_data = np.vstack([X_train, X_valid, X_test])
y_data = np.concatenate([y_train, y_valid, y_test])

# GS path
gs_path = 'gs://secret/fashionmnist'

# Storgae client
storage_client = storage.Client.from_service_account_json(json_credentials_path='secret.json')
bucket = storage_client.get_bucket('secret-bucket')

# Fill matrix
for i, x in enumerate(X_data):
    # Console print
    if i % 1000 == 0:
        print('Uploading image {image}'.format(image=i))
    # Reshape and export image
    img = array_to_img(x=x.reshape(28, 28, 1))
    img.save(fp='fashionmnist' + '/' + 'image_' + str(i) + '.jpg')
    # Add info to data frame
    files.iloc[i, 1] = gs_path + '/' + 'image_' + str(i) + '.jpg'
    files.iloc[i, 2] = y_data[i]
    # Upload to GCP
    blob = bucket.blob('fashionmnist/' + 'image_' + str(i) + '.jpg')
    blob.upload_from_filename('fashionmnist/' + 'image_' + str(i) + '.jpg')
    # Delete image file
    os.remove('fashionmnist/' + 'image_' + str(i) + '.jpg')

# Export CSV file
files.to_csv(path_or_buf='fashionmnist.csv', header=False, index=False)

The function load_mnist is from the Fashion-MNIST repository and imports the training and test arrays into Python. After importing the training set, 10,000 examples are sampled and sotored as validation data using train_test_split from sklean.model_selection. The training, validation and test arrays are then stacked into X_data in order to have a single object for iteration. A placeholder DataFrame is initialized to store the required information (partition, filepath and label), required by AutoML Vision. storage from google.cloud connects to GCP using a service account json file (which I will, of course, not share here). Finally, the main process takes place, iterating over X_data, generating an image for each row, saving it to disk, uploading it to GCP and deleting the image since it is no longer needed. Lastly, I uploaded the exported CSV file into the Google Cloud storage bucket of the project.

Getting into AutoML

AutoML Vision is currently in Beta, which means that you have to apply before trying it out. Since me and my colleagues are currently exploring the usage of automated machine learning in a computer vision project for one of our customers, I already have access to AutoML Vision through the GCP console.

The start screen looks pretty unspectacular at this point. You can start by clicking on “Get started with AutoML” or read the documentation, which is pretty basic so far but informative, especially when you’re not familiar with basic machine learning concepts such as train-test-splits, overfitting, prcision / recall etc.

After you started, Google AutoML takes you to the dataset dialog, which is the first step on the road to the final AutoML model. So far, nothing to report here. Later, you will find here all of your imported datasets.

Generating the dataset

After hitting “+ NEW DATASET” AutoML takes you to the “Create dataset” dialog. As mentioned before, new datasets can be added using two different methods, shown in the next image.

I’ve already uploaded the images from my computer as well as the CSV file containing the GS filepaths, partition information as well as the corresponding labels into the GS bucket. In order to add the dataset to AutoML Vision you must specify the filepath to the CSV file that contains the image GS-filepaths etc.

In the “Create dataset” dialog, you can also enable multi-label classification, if you have multiple labels per image, which is also a very helpful feature. After hitting “CREATE DATASET”, AutoML iterates over the provided file names and builds the dataset for modeling. What exactly is does, is neither visible nor documented. This import process may take a while, so it is showing you the funky “Knight Rider” progress bar.

After the import is finished, you will recieve an email from GCP, informing you that the import of the dataset is completed. I find this helpful because you don’t have to keep the browser window open and stare at the progress bar all the time.

The email looks a bit weird, but hey, it’s still beta…

Training a model

Back to AutoML. The first thing you see after building your dataset are the imported images. In this example, the images are a bit pixelated because they are only 28×28 in resolution. You can navigate through the different labels using the nav bar on the left side and also manually add labels so far unlabeled images. Furthermore, you can request a human labeling service if you do not have any labels that come with your images. Additionally, you can create new labels if you need to add a category etc.

Now let’s get serious. After going to the “TRAIN” dialog, AutoML informs you about the frequency distribution of your labels. It recommends a minimum count of $n=100$ labels per class (which I find quite low). Also, it seems that it shows you the frequencies of the whole dataset (train, validation and test together). A grouped frquency plot by data partition would be more informative at this point, in my opinion.

A click on “start training” takes you to a popup window where you can define the model name and the allocate a training budget (computing time / money) you are willing to invest. You have the choice between “1 compute hour”, whis is free for 10 models every month, or “24 compute hours (higher quality)” that comes with a price tag of approx. 480 $ (1 hour of AutoML computing costs 20 $. Hovever, if the architecture search converges at an earlier point, you will only pay the amount of computing time that has been consumed so far, which I find reasonable and fair. Lastly, there is also the option to define a custom training time, e.g. 5 hours.

In this experiment, I tried both, the “free” version of AutoML but I also went “all-in” and seleced the 24 hours option to achieve the best model possible (“paid model”). Let’s see, what you can expect from a 480 $ cutting edge AutoML solution. After hitting “START TRAINING” the familiar Knight Rider screen appears telling you, that you can close the browser window and let AutoML do the rest. Naise.

Results and evaluation

First, let’s start with the free model. It took approx. 30mins of training and seemed to have converged a solution very quickly. I am not sure, what exactly AutoML does when it evaluates convergence criteria but it seems to be different between the free and paid model, because the free model converged already around 30 minutes of computing and the paid model did not.

The overall model metrics of the free model look pretty decent. An average precision of 96.4% on the testset at a macro class 1 presision of 90.9% and a recall of 87.7%. The current accuracy benchmark on the Fashion-MNIST dataset is at 96.7% (WRN40-4 8.9M parameters) followed by 96.3% (WRN-28-10 + Random Erasing) while the accuracy of the low budget model is only at 89.0%. So the free AutoML model is pretty far away from the current Fashion-MNIST benchmark. Below, you’ll find the screenshot of the free model’s metrics.

The model metrics of the paid model look significantly better. It achieved an average precision of 98.5% on the testset at a macro class 1 presision of 95.0% and a recall of 92.8% as well as an accuracy score of 93.9%. Those results are close to the current benchmark, however, not so close as I hoped. Below, you’ll find the screenshot of the paid model’s metrics.

The “EVALUATE” tab also shows you further detailed metrics such as precision / recall curves as well as sliders for classification cutoffs that impact the model metrics respectively. At the bottom of the page you’ll find the confusion matrix with relative freuqencies of correct and misclassified examples. Furthermore, you can check images of false positives and negatives per class (which is very helpful, if you want to understand why and when your model is doing something wrong). Overall, the model evaluation functionalities are limited but user friendly. As a more profound user, of course, I would like to see more advanced features but considering the target group and the status of development I think it is pretty decent.


After fitting and evaluating your model you can use several methods to predict new images. First, you can use the AutoML user interface to upload new images from your local machine. This is a great way for unexperienced users to apply their model to new images and get predictions. For advanced users and developers, AutoML vision exposes the model through an API on the GCP while taking care of all the technical infrastructure in the background. A simple Python script shows the basic usage of the API:

import sys
from google.cloud import automl_v1beta1

# Define client from service account json
client = automl_v1beta1.PredictionServiceClient.from_service_account_json(filename='automl-XXXXXX-d3d066fe6f8c.json')

# Endpoint
name = 'projects/automl-XXXXXX/locations/us-central1/models/ICNXXXXXX

# Import a single image
with open('image_10.jpg', 'rb') as ff:
    img = ff.read()

# Define payload
payload = {'image': {'image_bytes': img}}

# Prediction
request = client.predict(name=name, payload=payload, params={})

# Console output
payload {
  classification {
    score: 0.9356002807617188
  display_name: "a_0"

As a third method, it is also possible to curl the API in the command line, if you want to go full nerdcore. I think, the automated API exposure is a great feature because it lets you integrate your model in all kinds of scripts and applications. Furthermore, Google takes care of all the nitty-gritty things that come into play when you want to scale the model to hundrets or thousands of API requests simultaneously in a production environment.

Conclusion and outlook

In a nutshell, even the free model achieved pretty good results on the test set, given that the actual amount of time invested in the model was only a fraction of time it would have taken to build the model manually. The paid model achieved significantly better results, however at a cost note of 480 $. Obviously, the paid service is targeted at data science professionals and companies.

AutoML Vision is only a part of a set of new AutoML applications that come to the Google Cloud (check these announcements from Google Next 18), further shaping the positioning of the platform in the direction of machine learning and AI.

In my personal opinion, I am confident that automated machine learning solutions will continue to make their way into professional data science projects and applications. With automated machine learning, you can (1) build baseline models for benchmarking your custom solutions, (2) iterate use cases and data products faster and (3) get quicker to the point, when you actually start to make money with your data – in production. Sebastian Heinz Sebastian Heinz

Learn more!

As one of the leading companies in the field of data science, machine learning, and AI, we guide you towards a data-driven future. Learn more about statworx and our motivation.
About us