In this example I’m going to be using a dataset of workout sessions used in a previous article. It looks like this

Workout Dataset, where day category = 0/1 corresponds to weekday/weekend

A bare bones scatter plot would look like this

Which you can replicate with the following code

import pandas as pd
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]

def scatterplot(df, x_dim, y_dim):
x = df[x_dim]
y = df[y_dim]
  fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y)
  plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’)

The usual next step for me is to label the axes and add a title so each plot is appropriately labeled.

The code change is minimal, but definitely makes a difference.

import pandas as pd
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]

def scatterplot(df, x_dim, y_dim):
x = df[x_dim]
y = df[y_dim]
  fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y)

#adds a title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')

  plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’)

What about removing that box?

In order to change the default box around the plot, we have to actually remove some of the plot’s borders.

import pandas as pd
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]

def scatterplot(df, x_dim, y_dim):
x = df[x_dim]
y = df[y_dim]
  fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y)

#adds a title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')

#removing top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

  plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’)

Major Gridlines

Something that I usually like to add to my plots are major gridlines. It helps with readability by reducing the amount of white background. You can play around with the its width linewidth and transparency alpha.

import pandas as pd
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
x = df[x_dim]
y = df[y_dim]
  fig, ax = plt.subplots(figsize=(10, 5))
ax.scatter(x, y)

#adds a title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')

#removing top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

  #adds major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
  plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’)

Aesthetics

You can see that some of the dots in the plot overlap. To improve readability even more, we can adjust the dots‘ transparency — alpha.

import pandas as pd
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
x = df[x_dim]
y = df[y_dim]
  fig, ax = plt.subplots(figsize=(10, 5))
  #customizes alpha for each dot in the scatter plot
ax.scatter(x, y, alpha=0.70)

#adds a title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')

#removing top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

  #adds major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
  plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’)

There is still a bit of overlap, but at least the transparency improved the readability of the majority of the dots.

Colors

Since we have the day category we can also try identifying each dot in our plot with a different color.

For that you can choose from two different approaches:

#1 Defining your own color palette

import pandas as pd
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
x = df[x_dim]
y = df[y_dim]
fig, ax = plt.subplots(figsize=(10, 5))

#defining an array of colors
colors = ['#2300A8', '#00A658']

  #assigns a color to each data point
ax.scatter(x, y, alpha=0.70, color=colors)

#adds a title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')

#removing top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

#adds major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’)

#2 Using Python Color Maps

To paint each dot according to its day category I need to introduce a few new components in the code

  • Import the color map library
  • Take the day category as a parameter, so the corresponding color can be mapped
  • Use parameterc from the scatter method to assign the color sequence
  • Use parameter cmap to assign the color map to be used. I’m going to use the brg color map
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim, category):
x = df[x_dim]
y = df[y_dim]
  fig, ax = plt.subplots(figsize=(10, 5))
  #applies the custom color map along with the color sequence
ax.scatter(x, y, alpha=0.70, c= dfdata-analysis, cmap=cm.brg)

#adds a title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')

#removing top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

  #adds major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
  plt.show()
scatterplot(df, ‘distance_km’, ‘duration_min’, ‘day_category’)

Legends

So far, we’ve been using the native scatter method to plot each data point. In order to add a legend, we’ll have to change the code a little bit.

We’ll have to

  • Take the day category as a parameter, so we have our labels
  • Convert the numerical (0,1) labels into categorical labels (weekday, weekend)
  • Iterate through the dataset in order to assign a label to each data point
import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt
#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim, category):
x = df[x_dim]
y = df[y_dim]
   #converting original (numerical) labels into categorical labels
categories = dfdata-analysis.apply(lambda x: 'weekday' if x == 0 else 'weekend')
   fig, ax = plt.subplots(figsize=(10, 5))
   #assigns a color to each data point
colors = ['#2300A8', '#00A658']
   #iterates through the dataset plotting each data point and assigning it its corresponding color and label
for i in range(len(df)):
ax.scatter(x.ix[i], y.ix[i], alpha=0.70, color = colors[i%len(colors)], label=categories.ix[i])
   #adds title and axes labels
ax.set_title('Distance vs Workout Duration')
ax.set_xlabel('Distance (Km)')
ax.set_ylabel('Workout Duration (min)')
   #removing top and right borders
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
   #adds major gridlines
ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
#adds legend
ax.legend(categories.unique())
plt.show()
scatterplot(df, 'distance_km', 'duration_min', 'day_category')

Source link