Venn Diagrams (and Pokemon) in Python: Matplotlib, VennData & alternatives

The following is an extract and a bit of a rewrite from a Python case study and subsequent presentation I did for a subject in a Data Science course. I was given the topic of Venn Diagrams and roughly 8-10 minutes to present 8-10 slides worth of information. I was asked to give a background into the concept of the topic, work on a case study and then present findings, and then also give some examples of other applications in practice.

A brief history

A Venn Diagram is a type of diagram that is used to express the logical relation between different sets. 

John Venn first presented the concept behind Venn diagrams in a paper published by the London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science titled, On the Diagrammatic and Mechanical Representation of Propositions and Reasonings.

Venn (1880) refers to his diagrams as Eulerian Circles and offers them as an evolution and improvement of Euler Diagrams (c. 1791). He notes the popularity of Euler Diagrams however goes on to write that they have major logical defects:

  “The great bulk of the propositions which we commonly meet with are founded, and rightly founded, on an imperfect knowledge of the actual mutual relations of the implied classes to one another.” (p.8)

Venn argues that Euler Diagrams cannot “forbid the natural expression of such uncertainty, and are therefore only directly applicable to a very small number of such propositions.” (Venn M.A., 1880, p.2)

Evolution and use

Building upon George Boole’s Algebra of Logic (c. 1847) Venn outlined a method for illustrating the following propositions:

Venn goes on to demonstrate the 3 circle Venn Diagram (perhaps the most famous) as well as grappling with models that use 4 and 5 ellipses before admitting that “beyond five terms it hardly seems as if (the) diagrams offered much substantial help”. (Venn M.A., 1880, p.8)

People have continued to push the boundaries of what is possible with Venn Diagrams, with A.W.F. Edwards (2004) writing in his book Cogwheels of the mind: The story of Venn Diagrams that (strangely enough) after being bitten by a dog (p.87) he sat down and produced a seven-set Venn diagram by first arranging congruent lines on a linear diagram.

The use of Venn Diagrams today is prolific, and they can be found in use in many different disciplines, conveying many different relationships from diverse points of information.

Some contemporary examples

Above is a screen grab of a slide I provided around contemporary use of Venn. Clockwise from left: An environmental sciences paper about the relationships between microbiomes, a study of language that shows the intersections between Greek, Latin and Cyrillic alphabets, (infamous) business diagrams – this one is for user experience design and looks at the intersections between people, technology, design and business, and finally (just for fun) the intersect of a beaver playing guitar and a duck playing keyboard is obviously a platypus on a Keytar.

Finding the intersect

Basic formula:

  • n(AUB) = n(A) + n(B) – n(A⋂B)

Example 1: Working out intersect by observation

  • Set A = {m, p, q, r, s, t, u, v}
  • Set B = {m, n, o, p, q, i, j, k, g}

X = ‘m, p, q’.

Example 2: Calculating intersect and set difference

60 people were asked what their favourite movie genres were. 47 liked Comedy movies and 32 liked Horror movies. All liked at least 1 genre. How many people like both genres? How many people only like Horror movies?

  • n(A U B) = 60
  • X = 47 + 32 – 60 = 19
  • Horror fans only = 32 – 19 = 13

X = 13.

Example 3: Intersect showing probability of two events happening at the same time

  • P(A∩B) = P(A) × P(B)
  • P(A∩B) = 26/52 × 2/52
  • = (26×2)/(52×52)
  • = 1/52

X = 1 in 52 chance.

Case studies

Below are my 3 x Python case studies where I set out to explore the various pros, cons and limitations of a variety of existing Python packages that allow for the creation of Venn diagrams.

I used a Pokemon dataset from Kaggle as the people I was presenting to were familiar with it and I saw an opportunity for some Vennin’ in the Pokemon types (rock, fire, water etc.) as Pokemon usually seem to have 2 types.

matplotlib_venn

The matplotlib_venn package contains a series of ‘functions for plotting area-proportional two and three-way Venn diagrams in matplotlib. I was up for a challenge so figured I’d start with a 3 circle Venn. I started out by loading in all my required packages and my data, as well as dropping any rows with empty data.

# Case Study: 3 Circle Venn Diagram
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib_venn as venn
from matplotlib_venn import venn3, venn3_circles

# Import data
df_pokemon = pd.read_csv ("Pokemon.csv")
df_pokemon_co = pd.DataFrame(df_pokemon.dropna(how='any',axis=0))

I made an observation early on that there wasn’t necessarily any reciprocity between the types, for instance there are 6 entries for Type1/Rock with Type2 /Water, and only 4 entries for Type1/Water with Type2/Rock. I’m sure any real Pokemon fan would have been all over this.

In order to combat this I assigned dummy data to the types and then consolidated the data so that I would now have distinct, binary relationships between the Pokémon that I would plot in my Venn diagram (in the example above this would effectively mean 7 entries for Water/Rock – kinda cheating a bit).

I then looked over the data in order to pick groups with reasonably sized sets that had the necessary relationships for me to build a 3 circle Venn diagram. I landed with Rock, Ground and Water.

# Get dummy data for all Pokemon Types
dummy_types = pd.get_dummies(df_pokemon_co, columns=['Type 1', 'Type 2'])

# Combine dummy data for specified Type 1 and Type 2's so that there is a binary relationship between types
df_pokemon_co['Rock'] = dummy_types['Type 1_Rock'] + dummy_types['Type 2_Rock']
df_pokemon_co['Ground'] = dummy_types['Type 1_Ground'] + dummy_types['Type 2_Ground']
df_pokemon_co['Water'] = dummy_types['Type 1_Water'] + dummy_types['Type 2_Water']

After choosing the appropriate sets for the Venn diagram I set out to determine the very important question: What combination of rock, water and ground type Pokémon have the strongest median defence?

Using the matplotlib venn3 package I assigned all of my sets and intersects. In matplotlib you need to target each area of the chart using the following numbering system.

  • Set A is 100, Set B is 010, Set C is 001.
  • A intersect B is 110, B intersect C is 011, C intersect A is 101.
  • The middle intersect is 111.

Cogwheels of the mind: The story of Venn Diagrams has a good diagram (page 6) which demonstrates this pattern (see below), and it is actually a good way of further demonstrating Venn’s propositions for example, 100 is x that is not y, and 110 is x that is y (so we get two ‘ones’… if that makes sense).

It is worth noting that matplotlib (as far as I can tell) won’t actually calculate the intersects for you and relies on you doing this via other means. In my case I used pandas loc to get the median defence only when the 2 conjoining sets matched (for instance rock=1 and water=1).

# Set size of Venn Diagram
plt.figure(figsize=(12,12))

# Create 3 circle Venn Diagram
v = venn3(subsets = ( # Configure Venn subsets
    # N.b. Subsets have ID representative of their logic. E.g. Rock = 100 (Equal to Rock, but not to Water or Ground).
    # [100] Median Defense: All Rock Types
    np.median((df_pokemon_co[df_pokemon_co['Rock']==1]['Defense'])),
    # [010] Median Defense: All Water Types
    np.median((df_pokemon_co[df_pokemon_co['Water']==1]['Defense'])),                 
    # [110] Median Defense: Rock and Water intersect
    df_pokemon_co.loc[(df_pokemon_co['Rock']==1) & (df_pokemon_co['Water']==1), ('Defense')].median(),                       
    # [001] Median Defense: All Ground Types 
    np.median((df_pokemon_co[df_pokemon_co['Ground']==1]['Defense'])),                 
    # [101] Median Defense: Rock and Ground intersect
    df_pokemon_co.loc[(df_pokemon_co['Rock']==1) & (df_pokemon_co['Ground']==1), 'Defense'].median(),        
    # [011] Median Defense: Ground and Water intersect
    df_pokemon_co.loc[(df_pokemon_co['Ground']==1) & (df_pokemon_co['Water']==1), 'Defense'].median(),                     
    # [111] There are no Pokemon with 3 types, so set to 0
    0), 

Once I had entered in all my set data I set to work styling my Venn diagram giving it rock, water and ground colours, tweaking the transparency, increasing font sizes, adding overlapping line styles, and a title.

    # Set circle colours (100, 010, 001)
    set_colors=('silver', 'skyblue', 'tan'), 
    # Set circle labels  
    set_labels = ('Rock', 'Water', 'Ground'),
    # Control the transparency of circles
    alpha = 0.8)

# Target the middle segment (111) and clear null text
v.get_label_by_id('111').set_text('')

# Adjust fontsize for set labels (e.g. Rock) and subset labels
for text in v.set_labels:
    text.set_fontsize(18)
for text in v.subset_labels:
    text.set_fontsize(16)

# Add overlapping circles with dashed linestyle, we know our subset values now so can enter them in
venn3_circles(subsets = (103, 75, 104, 82, 115, 71.5, 0), 
              linestyle='dashed', linewidth=2, color='black');

# Set the Venn Diagram title
plt.title("Median Defence of Rock, Water & Ground Pokemon", fontsize=20)

# Add an annotation containing the names of the Pokemon with the highest Median Defense
plt.annotate('Geodude, Graveler, Golem,\nOnix, Rhyhorn, Rhydon,\nLarvitar, Pupitar, Rhyperior', xy=v.get_label_by_id('101').get_position() - np.array([0, 0.05]), xytext=(-100,-75),
             ha='center', textcoords='offset points', fontsize=12, bbox=dict(boxstyle='round,pad=0.6', fc='white', alpha=0.9),
             arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='black'))

# Print Venn Diagram
plt.show()

The completed Venn shows us that the combination of rock and ground Pokémon in intersect 101 have the highest median defence. I have also added in an annotation with the names of the Pokémon that fall into this intersect using plt.annotate.

matplotlib_venn

3 circles is cool I guess, but how mental can we go? 3 circles is actually the maximum that matplotlib can accomplish, so I went looking for other packages and came across VennData, which promises “pretty good results”.

What you see here is a 7 set Venn Diagram for rock, ground, water, fire, grass, steel and ghost type Pokémon

VennData is capable of calculating the dummy data intersects. It takes a kind of ‘best fit’ approach and as you can see the end result is quite crowded and really compounds what Venn said about diagrams of this size not offering much substantial help.

# Case Study: 5 Circle Venn Diagram
# Import packages
from venndata import venn

# Combine additional dummy data (n.b. we already have rock, ground and water)
df_pokemon_co['Fire'] = dummy_types['Type 1_Fire'] + dummy_types['Type 2_Fire']
df_pokemon_co['Grass'] = dummy_types['Type 1_Grass'] + dummy_types['Type 2_Grass']
df_pokemon_co['Steel'] = dummy_types['Type 1_Steel'] + dummy_types['Type 2_Steel']
df_pokemon_co['Ghost'] = dummy_types['Type 1_Ghost'] + dummy_types['Type 2_Ghost']
df_pokemon_co['Dragon'] = dummy_types['Type 1_Dragon'] + dummy_types['Type 2_Dragon']
df_pokemon_co['Dark'] = dummy_types['Type 1_Dark'] + dummy_types['Type 2_Dark']
df_pokemon_co['Psychic'] = dummy_types['Type 1_Psychic'] + dummy_types['Type 2_Psychic']
df_pokemon_co['Poison'] = dummy_types['Type 1_Poison'] + dummy_types['Type 2_Poison']
df_pokemon_co['Ice'] = dummy_types['Type 1_Ice'] + dummy_types['Type 2_Ice']

# Define data to use
df = df_pokemon_co[['Rock', 'Ground', 'Water', 'Fire', 'Grass', 'Steel', 'Ghost']]

# Calculate the intersections between the sets
fineTune=True
labels, radii, actualOverlaps, disjointOverlaps = venn.df2areas(df, fineTune=fineTune)

# Radii of the circles representing the sets 
print(radii)
# Dictionary of all two set interactions sizes
print(actualOverlaps)
# Dictionary of all mutually disjoint intersection sizes
print(disjointOverlaps)

# Plot the 5 circle Venn Diagram
plt.rcParams['figure.figsize'] = [15, 15]
fig, ax = venn.venn(radii, actualOverlaps, disjointOverlaps, labels=labels, labelsize='auto', cmap=None, fineTune=fineTune)

UpSetPlot

I was interested to find out whether there were more convenient ways for examining the relationships between sets and I came across a package called UpSetPlot.

What we have hear is a comparison of 12 different Pokémon types, including value counts for each type displayed as a bar chart at the top of the diagram. For instance, the first column shows us that there are 10 ground/water Pokémon, and the 2nd column shows us that there are 3 or 4 grass water Pokémon.

Ultimately this is a far more convenient way to view data for multiple sets at a glance. It’s also heavily customisable and has extensive documentation.

# Case Study: Examining intersctions of 7 types of Pokemon using UpSetPlot
# Import packages
from upsetplot import UpSet, plot, from_memberships, from_contents, from_indicators

# Import upset data (7 x Type columns with Boolean statements)
upsetdata = pd.read_csv ("upset2.csv")

# Create sets from the Boolean data
upset = from_indicators(upsetdata)
indicators = pd.concat([upsetdata], axis=1)
UpSet(from_indicators(["Rock", "Ground", "Water", "Grass", "Fire", "Steel", "Dragon", "Ghost", "Dark", "Psychic", "Poison", "Ice"], data=upsetdata))

Applications in practice

During my research I came across the following interesting examples of other ways Venn’s are used.

Paper: A consistent and general modified Venn diagram approach that provides insights into regression analysis (link)

It is often found that the traditional Venn diagram approach is not suitable for a particular study and there are many interesting cases available where people have modified Venn diagrams to suit their purposes.

O’Brien RM, (2008) details one such modification where they have adapted Venn diagrams in order to examine multiple regressions. “This approach allows the visualization of the components involved in multiple regression coefficients, their standard errors, and the F-test and t-test associated with these coefficients as well as other statistics commonly reported in the output of multiple regression programs.” (p.1)

Paper: BioVenn – a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams (link)

Hulsen et. al. (2008) write that Venn diagrams are often used in genomics projects as a way to “quickly observe similarities and differences between the data sets they are analyzing.” (p.2)

The paper notes various packages that are available for making Venn diagrams along with their pros and cons in the context of using for visualising genomic data. A PHP solution is derived and documented in the paper which is capable of processing the relationships between 3 sets of Affymetric probe identifiers.

“…each circle represents one of the ID sets used as input. The size of the circle corresponds with the number of unique IDs in that specific set. The overlap of each two circles also corresponds with the number of IDs belonging to both of the sets represented by these circles.”

References

Cuemath. A∩B Formula. https://www.cuemath.com/probability-a-intersection-b-formula/

Edwards, A. W. F. (2004). Cogwheels of the Mind: The Story of Venn Diagrams. Kiribati: Johns Hopkins University Press.

Hulsen, T., de Vlieg, J. & Alkema, W. BioVenn – a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genomics 9, 488 (2008). https://doi.org/10.1186/1471-2164-9-488

Kho, J. (2020, January 3). How to Create and Customize Venn Diagrams in Python. Towards Data Science. https://towardsdatascience.com/how-to-create-and-customize-venn-diagrams-in-python-263555527305

Mandal, 2. (2020). VennData. PyPl. https://github.com/mandalsubhajit/venndata

Nothman, J. (2020). UpSetPlot documentation. Read the Docs. https://upsetplot.readthedocs.io/en/stable/index.html

O’Brien RM. A consistent and general modified Venn diagram approach that provides insights into regression analysis. PLoS One. 2018 May 17;13(5):e0196740. doi: 10.1371/journal.pone.0196740. PMID: 29771948; PMCID: PMC5957393.

Tretyakov, K. (2022, April 6). Functions for plotting area-proportional two- and three-way Venn diagrams in matplotlib. PyPI. https://pypi.org/project/matplotlib-venn/

Venn M.A. (1880). On the diagrammatic and mechanical representation of propositions and reasonings , Philosophical Magazine Series 5, 10:59, 1-18. http://dx.doi.org/10.1080/14786448008626877


Posted

in

, ,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *