Introduction
At STATWORX we love beautiful plots. One of my favorite plotting libraries is plotly. It’s being developed by the company of the same name since 2012. Plotly.js is a high-level javascript library for interactive graphics and offers wrappers for a diverse range of languages, like Python, R or Matlab. Furthermore, it is open source and licensed under the MIT license, therefore it can be used in a commercial context. Plotly offers more than 30 different chart types. Another reason we at STATWORX use Plotly extensively is that it can be easily integrated into web-based frameworks, like Dash or R Shiny.
How does it work?
A Plotly plot is based on the following three main elements: Data, Layout and Figure.
Data
The Data object can contain several traces. For example, in a line chart with several lines, each line is represented by a different trace. According to that, the data object contains the data which should be plotted but also the specification of how the data should be plotted.
Layout
The Layout object defines everything, that is not related to the data. It contains elements, like the title, axis titles or background-color. However, you can also add annotations or shapes with the layout object.
Figure
The Figure object includes both data and layout. It creates our final figure for plotting, which is just a simple dictionary-like object. All figures are built with plotly.js, so in the end, the Python API only interacts with the plotly.js library.
Application
Let’s visualize some data. For that purpose, we will use the LA Metro Bike Share dataset, which is hosted by the city of Los Angeles and contains anonymized Metro Bike Share trip data. In the following section, we will use Plotly for Python and compare it later on with the R implementation.
Creating our first line plot with Plotly
First, we will generate a line plot that shows the number of rented bikes over different dates differentiated over the passholder type. Thus, we first have to aggregate our data before we can plot it. As shown above, we define our different traces. Each trace contains the number of rented bikes for a specific passholder type. For line plots, we use the Scatter()
– function from plotly.graph_objs
. It is used for scatter and line plots, however, we can define how it is displaced by setting the mode parameter accordingly. Those traces are unified as a list in our data object. Our layout object consists of a dictionary, where we define the main title and the axis titles. At last, we put our data and layout object together as a figure-object.
import pandas as pd
import plotly.graph_objs as go
import plotly.plotly as py
df = pd.read_pickle(path="LA_bike_share.pkl")
rental_count = df.groupby(["Start_Date", "Passholder_Type"]).size().reset_index(name ="Total_Count")
trace0 = go.Scatter(
x=rental_count.query("Passholder_Type=='Flex Pass'").Start_Date,
y=rental_count.query("Passholder_Type=='Flex Pass'").Total_Count,
name="Flex Pass",
mode="lines",
line=dict(color="#013848")
)
trace1 = go.Scatter(
x=rental_count.query("Passholder_Type=='Monthly Pass'").Start_Date,
y=rental_count.query("Passholder_Type=='Monthly Pass'").Total_Count,
name="Monthly Pass",
mode="lines",
line=dict(color="#0085AF")
)
trace2 = go.Scatter(
x=rental_count.query("Passholder_Type=='Walk-up'").Start_Date,
y=rental_count.query("Passholder_Type=='Walk-up'").Total_Count,
name="Walk-up",
mode="lines",
line=dict(color="#00A378")
)
data = [trace0,trace1,trace2]
layout = go.Layout(title="Number of rented bikes over time",
yaxis=dict(title="Number of rented bikes",
zeroline=False),
xaxis=dict(title="Date",
zeroline = False)
)
fig = go.Figure(data=data, layout=layout)
Understanding the structure behind graph_objs
If we output the figure-object, we will get the following dictionary-like object.
Figure({
'data': [{'line': {'color': '#013848'},
'mode': 'lines',
'name': 'Flex Pass',
'type': 'scatter',
'uid': '5d8c0781-4592-4d19-acd9-a13a22431ccd',
'x': array([datetime.date(2016, 7, 7), datetime.date(2016, 7, 8),
datetime.date(2016, 7, 9), ..., datetime.date(2017, 3, 29),
datetime.date(2017, 3, 30), datetime.date(2017, 3, 31)], dtype=object),
'y': array([ 61, 93, 113, ..., 52, 36, 40])},
{'line': {'color': '#0085AF'},
'mode': 'lines',
'name': 'Monthly Pass',
'type': 'scatter',
'uid': '4c4c76b9-c909-44b7-8e8b-1b0705fa2491',
'x': array([datetime.date(2016, 7, 7), datetime.date(2016, 7, 8),
datetime.date(2016, 7, 9), ..., datetime.date(2017, 3, 29),
datetime.date(2017, 3, 30), datetime.date(2017, 3, 31)], dtype=object),
'y': array([128, 251, 308, ..., 332, 312, 301])},
{'line': {'color': '#00A378'},
'mode': 'lines',
'name': 'Walk-up',
'type': 'scatter',
'uid': '8303cfe0-0de8-4646-a256-5f3913698bd9',
'x': array([datetime.date(2016, 7, 7), datetime.date(2016, 7, 8),
datetime.date(2016, 7, 12), ..., datetime.date(2017, 3, 29),
datetime.date(2017, 3, 30), datetime.date(2017, 3, 31)], dtype=object),
'y': array([ 1, 1, 1, ..., 122, 133, 176])}],
'layout': {'title': {'text': 'Number of rented bikes over time'},
'xaxis': {'title': {'text': 'Date'}, 'zeroline': False},
'yaxis': {'title': {'text': 'Number of rented bikes'}, 'zeroline': False}}
})
In theory, we could build those dictionaries or change the entries by hand without using plotly.graph_objs
. However, it is much more convenient to use graph_objs than to write dictionaries. In addition, we can call help on those functions and see which parameters are available for which chart type and it also raises an error with more details if something went wrong. There is also the possibility to export the fig-figure object as a JSON and import it for example in R.
Displaying our plot
Nonetheless, we don’t want a JSON-File but rather an interactive graph. We now have two options, either we publish it online, as Plotly provides a web-service for hosting graphs including a free plan, or we create the graphs offline. This way, we can display them in a jupyter notebook or save them as a standalone HTML.
In order to display our plot in a jupyter notebook, we need to execute the following code
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
at the beginning of each Notebook. Finally, we can display our plot with iplot(fig)
.
Before publishing it online, we first need to set our credentials with
plotly.tools.set_credentials_file(username='user.name', api_key='api.key')
and use py.plot(fig, filename = 'basic-plot', auto_open=True)
instead of iplot(fig)
. The following graph is published online on Plotly’s plattform and embedded as an inline frame.
The chart above is fully interactive, which has multiple advantages:
- Select and deselect different lines
- Automatical scaling of the y-scale in case of deselected lines
- Hover-informations with the exact numbers and dates
- Zoom in and out with self-adjusting date ticks
- Different chart-modes and the ability to toggle additional options, like spike lines
- Possibility to include a range-slider or buttons
The graph shows a fairly clear weekly pattern, with Monthly Passholders having their high during the workweek, while Walk-ups are more active on the weekend. Apart from some unusual spikes, the number of rented bikes is higher for Monthly Passholders than for Walk-ups.
Visualizing the data as a pie chart
The next question is: how does the total duration look for the different passholder types? First, we need to aggregate our data accordingly. This time we will build a pie chart in order to get the share of the total duration for each passholder type. As we previously did with the line chart, we must first generate a trace object and use Pie()
from graph_objs
. The arguments we use are different now: we have labels
and values
instead of x
and y
. We’re also able to determine, which hover-information we want to display and can add with hovertext
custom information, or completely customize it with hovertemplate
. Afterward, the trace object goes into go.Figure()
in form of a list.
share_duration = df.groupby("Passholder_Type").sum().reset_index()
colors = ["#013848", "#0085AF", "#00A378"]
trace = go.Pie(labels=share_duration.Passholder_Type,
values=share_duration.Duration,
marker=dict(colors=colors,
line=dict(color='white', width=1)),
hoverinfo="label+percent"
)
fig = go.Figure(data=[trace])
The pie chart shows us, that 59% of the total duration is caused by Walk-ups. Thus, we could assume that the average duration for Walk-ups is higher than for Monthly Passholders.
There is one more thing: figure factory
Now, let’s plot the distribution of the average daily duration. For that we use the create_distplot()
-function from the figure_factory
. The figure factory module contains wrapper functions that create unique chart types, which are not implemented in the nativ plotly.js library, like bullet charts, dendrograms or quiver plots. Thus, they are not available for other languages, like R or Matlab. However, those functions also deviate from the structure for building a Plotly graph we discussed above and are also not consistent within figure_factory
. create_distplot()
creates per default a plot with a KDE-curve, histogram, and rug, respectively those plots can be removed with show_curve
, show_hist
and show_rug
set to False
. First, we create a list with our data as hist_data, in which every entry is displayed as a distribution plot on its own. Optionally, we can define group labels, colors or a rug text, which is displayed as hover information on every rug entry.
import plotly.figure_factory as ff
mean_duration=df.groupby(["Start_Date", "Passholder_Type"]).mean().reset_index()
hist_data = [mean_duration.query("Passholder_Type=='Flex Pass'").Duration,
mean_duration.query("Passholder_Type=='Monthly Pass'").Duration,
mean_duration.query("Passholder_Type=='Walk-up'").Duration]
group_labels = ["Flex Pass", "Monthly Pass", "Walk-up"]
rug_text = [mean_duration.query("Passholder_Type=='Flex Pass'").Start_Date,
mean_duration.query("Passholder_Type=='Monthly Pass'").Start_Date,
mean_duration.query("Passholder_Type=='Walk-up'").Start_Date]
colors = ["#013848", "#0085AF", "#00A378"]
fig = ff.create_distplot(hist_data, group_labels, show_hist=False,
rug_text=rug_text, colors=colors)
As we assumed, Walk-ups have a higher average duration than monthly or flex pass holders. The average daily duration for Walk-ups is peaking at around 0.6 hours and for Monthly and Flex Passholders already at 0.18, respectively 0.2 hours. Also, the distribution for Walk-ups is much flatter with a fat right tail. Thanks to the rug, we can see that for Flex Pass, there are some days with a very high average duration and due to the hover-information, we can immediately detect, which days have an unusually high average renting duration. The average duration on February 2, 2017, was 1.57 hours. Next, we could dig deeper and have a look on the possible reasons for such an unusual activity, for example, a special event or the weather.
Plotly with R
As mentioned in the beginning, Plotly is available for many languages. At STATWORX, we’re using Plotly mainly in R, especially if we’re creating a dashboard with R Shiny. However, the syntax is slightly different, as the R implementation utilizes R’s pipe-operator. Below, we create the same barplot in Python and in R. In Python, we aggregate our data with pandas, create different traces for every unique characteristic of Trip Route Category, specify that we want to create a stacked bar chart with our different traces and assemble our data and layout object with go.Figure()
.
total_count = df.groupby(["Passholder_Type", "Trip_Route_Category"]).size().reset_index(name="Total_count")
trace0 = go.Bar(
x=total_count.query("Trip_Route_Category=='Round Trip'").Passholder_Type,
y=total_count.query("Trip_Route_Category=='Round Trip'").Total_count,
name="Round Trip",
marker=dict(color="#09557F"))
trace1 = go.Bar(
x=total_count.query("Trip_Route_Category=='One Way'").Passholder_Type,
y=total_count.query("Trip_Route_Category=='One Way'").Total_count,
name="One Way",
marker=dict(color="#FF8000"))
data = [trace0, trace1]
layout = dict(barmode="stack")
fig = go.Figure(data=data, layout=layout)
With R, we can aggregate the data with dplyr and already start our pipe there. Afterward, we pipe the plotly function to it, in the same way we already specified which data frame we want to use. Within plot_ly()
, we can directly address the column name. We don’t have to create several traces and add them with add_trace()
, but can define the separation between the different Trip Route Category with the color
argument. In the end, we pipe the layout()
-function and define it as a stacked bar chart. Thus, with using the pipe-operator, the code looks slightly tidier. However, in comparison to the Python implementation, we are losing the neat functions of the figure factory.
basic_bar_chart <- df %>%
group_by(Passholder_Type, Trip_Route_Category) %>%
summarise( Total_count = n()) %>%
plot_ly(x = ~Passholder_Type,
y = ~Total_count,
color = ~Trip_Route_Category ,
type = 'bar',
marker=list(color=c(rep("#FF8000",3),rep("#09557F",3)))) %>%
layout( barmode = 'stack')
The bar plot shows that Walk-ups use their rented bikes more often for Round Trips in comparison to Monthly Passholders, which could be a reason for their higher average duration.
Conclusion
I hope I could motivate you to have a look at interactive graphs with Plotly instead of using static seaborn or ggplot plots, especially in case of hands-on sessions or dashboards. But there is also the possibility to create an interactive Plotly chart from a ggplot or Matplotlib object with one additional line of code.
With version 3.0 of plotly.py there have been many interesting new features like Jupyter Widgets, the implementation of imperative methods for creating a plot and the possibility to use datashader. Soon you’ll find a blog post on here on how to implement zoomable histograms with Plotly and Jupyter Widgets and why automatic rebinning makes sense by a colleague of mine.
[author class=”mtl” title=”Über den Autor”]