Python Pandas Plotting from a Java perspective

Peter Andersson
6 min readDec 31, 2020

Python is, despite its somewhat dated language syntax, an incredibly efficient tool for data analysis. The effectiveness with which you can quickly analyze large datasets with tools like numpy and pandas are nothing short of amazing.

However I was very hesitant to approach the Python environment for many years. The main problem I had was the “indenting syntax”, where blocks were separated not by begin/end or {} or even semicolons, but the indenting. Also, you could not mix tabs and spaces, and had to choose one or the other. In short I felt it was awkward and stupid and there were so many alternatives anyway, like Ruby, which I used for a long time (and still do).

But when got to try pandas, and specifically the Dataframe class, I was hooked. Nothing I had used made working with structured data this easy. It took incredibly little effort to load and analyze large amounts of data.

Then when someone showed me that it was also possible to do this interactively using Jupyter Notebook, I was hooked. The indenting was soon natural, with the help of plugins and other IDE support.

I guess I felt like what a non-programmer feels when they realize what a tool Excel can be, and starts building really complex worksheets, without programming. With Excel however, you inevitably run into problems with data access and the general limits of the GUI. With Python you have no such limits. The data can easily be accessed from local files, databases, remote APIs or even HTML tables.

To show a little about what pandas and matplotlib can do, here is a simple script that calculates the daily % return for the Apple stock:

import pandas as pddf = pd.read_csv("AAPL.csv")
# df Dataframe contains columns "date", "open","close","high", "low" and "volume", so lets drop a few
df = df.drop(columns=["open","high","low","volume"])
# calculate the daily % return for every row
df["pct"] = (df["close"] — df["close"].shift(1)) / 100.0

The dataframe now looks like:

          date    close      pct
0 20151228 26.705 NaN
1 20151229 27.185 0.00480
2 20151230 26.830 -0.00355
3 20151231 26.315 -0.00515
4 20160104 26.340 0.00025
... ... ... ...
1254 20201218 126.655 -0.02045
1255 20201221 128.230 0.01575
1256 20201222 131.880 0.03650
1257 20201223 130.960 -0.00920
1258 20201224 131.970 0.01010

You can easily add a couple of moving averages to the “close” column by:

df["ma50"] = df[“close”].rolling(50).mean()
df["ma200"] = df[“close”].rolling(200).mean()

giving the result:

date    close      pct      ma50      ma200
0 20151228 26.705 NaN NaN NaN
1 20151229 27.185 0.00480 NaN NaN
2 20151230 26.830 -0.00355 NaN NaN
3 20151231 26.315 -0.00515 NaN NaN
4 20160104 26.340 0.00025 NaN NaN
... ... ... ... ... ...
1254 20201218 126.655 -0.02045 118.9493 97.887800
1255 20201221 128.230 0.01575 119.1745 98.196225
1256 20201222 131.880 0.03650 119.3241 98.498950
1257 20201223 130.960 -0.00920 119.5213 98.809450
1258 20201224 131.970 0.01010 119.7369 99.159025

Wonderful! Now it would be interesting to plot this. And this is the section where I got into trouble.

Coming from Java, with strict type checking, and OOP stamped in my forehead, I expected to find examples like

# My naive expectation of the API!
graph = Graph()
graph.add_data(df)
graph.format_axis(“some formatting options”, “some formatting options”)
some_window.add(graph)

etc etc. Some sort of object oriented environment where each graph was an object and you had to do a lot of “gluing” to use and show the graphs.

Instead, when searching, I found examples like:

df[“ma50”].plot()
df[“ma200”].plot()
plt.show()

giving the result:

Ehhhh what?! How…..?

OK so the Dataframe apparently has a plot() method. But to which graph does it plot? And the second plot() automatically gets added to the first? And what is plt.show() doing (well its obvious but…). How does it access the plots? How does it know what to do?

This confused me a lot!

The reason is that DataFrame (and other pandas classes) have wrappers for calls to matplotlib, a very capable MATLAB-inspired Math Plotting Library that is separately installed and referenced like:

import matplotlib.pyplot as plt

Also, there is a lot of “reasonable defaults” happening behind the curtain. Most often you only have one plot, or you are at least only working on one plot at a time. So why keep a reference to it when Dataframe can? And Matplotlib.

So we can safely assume that MatPlotlib keeps a reference to the newest, or latest created Figure.

Proof:

# This creates one window with one plot
# Since no figure has been created, DataFrame will create one for us.
df["pct"].hist(bins=100)
# This creates another window with one plot
fig1 = plt.figure()
df["close"].plot();
# This creates a third window with two plots
fig2 = plt.figure()
df["ma50"].plot();
df["ma200"].plot();

Since we didn’t specify anywhere which window should contain the plot, it’s obvious that it was done for us!

On the other hand, if we should code everything using mathplotlib functions, it would look like this. (I’m creating a different plot here, one window with 4 subplots, each with a different chart)

fig, axes = plt.subplots(nrows=2,cols=2)
axes[0,1].plot(df[“close”])
plt.show()

Looks good, doesnt it? The plot ended up in plot [0,1] as intended. axis[0,1] is an AxisSubPlot instance, and has several utility methods.

However this next thing confused me a lot. DataFrame have a great little method hist(bins=) which can separate the data into a user-defined number of “bins”. These can then be presented in a histogram. THis is very useful when presenting distribution graphs for example.- But when I change the code to:

# NOTE: BUG HERE!
fig, axes = plt.subplots(nrows=2, ncols=2)
axes[0,0] = df[“pct”].hist(bins = 50)
axes[0,1].plot(df[“close”])
plt.show()

The subplot is placed in [1,1] although I specified [0,0]. Why is that?

And if I add

# MORE BUGGY CODE!
axes[1,0] = (df[“ma50”] — df[“ma200”]).hist(bins = 50)

to calculate the difference between the MA50 and MA200, and present it as a histogram to see where the most common distance between them is, it is ALSO placed in [1,1]! Very annoying! Now why is that?

Well it turned out that I had misunderstood how to use the DataFrame methods. I had also misunderstood What Axis in AxisSubPlot means. I (naturally…) assumed that Axis meant Axis, like in the X and Y-axis. Well not exactly. An AxisSubPlot is a whole plotarea.

So basically you can create Figures and AxisSubPlots however you like, and then pass them on to DataFrame and Series to plot on, like this (note that this is a mix of using MatPlotLib methods and Pandas methods, and it works fine) :

# WORKING VERSION!# This creates a Figure and 4 AxisSubPlot
fig, axes = plt.subplots(nrows=2, ncols=2)
# subplot [0,0]
df["pct"].hist(bins = 50, ax = axes[0,0])
# subplot [0,1]
axes[0,1].plot(df["close"])
# subplot [1,0]
(df["ma50"] - df["ma200"]).hist(bins = 50, ax = axes[1,0])
# subplot [1,1]
axes[1,1].plot(df["ma50"])
axes[1,1].plot(df["ma200"])
plt.show()

So I hope this article clears up a bit of the confusion that I assume is pretty common with the otherwise excellent pandas/matplotlib combination.

Now, it wouldn’t be fair to not mention Seaborn, the glossy top layer to the frameworks mentioned. It works tightly together with pandas and matplotlib, and is built to work with complete datasets.

However a detailed look will have to wait until next time!

--

--

Peter Andersson

Experienced programmer since 30+ years. Working as a consultant mainly in finance/trading but also in general data analytics. Likes Java and Python, and Linux.