Johns Hopkins COVID-19 Dataset in Pandas

COVID-19 is ravaging the globe. Let’s look at the excellent Johns Hopkins dataset using Pandas. This will serve both as a guideline for getting the data and exploring on your own, as well as an example of Pandas multi-indexing in an easy to understand situation. I am currently involved in science-responds.

My favorite links: worldometerarcGISProjections

A few more: COVID-19 dashnCoV201991-DIVOC

# Python 3.6+ required
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from urllib.error import HTTPError
plt.style.use('ggplot')

Anyway, now that we’ve made some basic imports, let’s write a function that can read in a datafile from GitHub:

def get_day(day: pd.Timestamp):

    # Read in a datafile from GitHub
    try:
        table = pd.read_csv(
            "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/"
            "master/csse_covid_19_data/csse_covid_19_daily_reports/"
            f"{day:%m-%d-%Y}.csv",
        )
    except HTTPError:
        return pd.DataFrame()
    
    # Cleanup - sadly, the format has changed a bit over time - we can normalize that here
    table.columns = [
        f.replace("/", "_")
        .replace(" ", "_")
        .replace("Latitude", "Lat")
        .replace("Longitude", "Long_")
        for f in table.columns
    ]

    # This column is new in recent datasets
    if "Admin2" not in table.columns:
        table["Admin2"] = None

    # New datasets have these, but they are not very useful for now
    table.drop(
        columns=["FIPS", "Combined_Key", "Lat", "Long_"], errors="ignore", inplace=True
    )
    
    # If the last update time was useful, we would make this day only, rather than day + time
    #   table["Last_Update"] = pd.to_datetime(table["Last_Update"]).dt.normalize()
    #
    # However, last update is odd, let's just make this the current day
    table["Last_Update"] = day
    
    # Make sure indexes are not NaN, which causes later bits to not work. 0 isn't
    # perfect, but good enough.
    # Return as a multindex
    return table.fillna(0).set_index(
        ["Last_Update", "Country_Region", "Province_State", "Admin2"], drop=True
    )

Now let’s loop over all days and build a multi-index DataFrame with the whole dataset. We’ll be doing quite a bit of cleanup here as well. If you do this outside of a function, you should never modify an object in multiple cells; ideally you create an object like df, and make any modifications and replacements in the same cell. That way, running any cell again or running a cell multiple times will not cause unusual errors and problems to show up.

def get_all_days(end_day = None):

    # Assume current day - 1 is the latest dataset if no end given
    if end_day is None:
        end_day = pd.Timestamp.now().normalize()

    # Make a list of all dates
    date_range = pd.date_range("2020-01-22", end_day)
    
    # Create a generator that returns each day's dataframe
    day_gen = (get_day(day) for day in date_range)
    
    # Make a big dataframe, NaN is 0
    df = pd.concat(day_gen).fillna(0).astype(int)
    
    # Remove a few duplicate keys
    df = df.groupby(level=df.index.names).sum()
    
    # Sometimes active is not filled in; we can compute easily
    df["Active"] = np.clip(
        df["Confirmed"] - df["Deaths"] - df["Recovered"], 0, None
    )
    
    # Change in confirmed cases (placed in a pleasing location in the table)
    df.insert(
        1,
        "ΔConfirmed",
        df.groupby(level=("Country_Region", "Province_State", "Admin2"))["Confirmed"]
        .diff()
        .fillna(0)
        .astype(int),
    )
    
    # Change in deaths
    df.insert(
        3,
        "ΔDeaths",
        df.groupby(level=("Country_Region", "Province_State", "Admin2"))["Deaths"]
        .diff()
        .fillna(0)
        .astype(int),
    )
    
    return df

If this were a larger/real project, it would be time to bundle up the functions above and put them into a .py file - notebooks are for experimentation, teaching, and high level manipulation. Functions and classes should normally move to normal Python files when ready.

Let’s look at a few lines of this DataFrame to see what we have:

df = get_all_days()
df
Confirmed ΔConfirmed Deaths ΔDeaths Recovered Active Incidence_Rate Case-Fatality_Ratio Incident_Rate Case_Fatality_Ratio
Last_Update Country_Region Province_State Admin2
2020-01-22 Hong Kong Hong Kong 0 0 0 0 0 0 0 0 0 0 0
Japan 0 0 2 0 0 0 0 2 0 0 0 0
Macau Macau 0 1 0 0 0 0 1 0 0 0 0
Mainland China Anhui 0 1 0 0 0 0 1 0 0 0 0
Beijing 0 14 0 0 0 0 14 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-12-01 West Bank and Gaza 0 0 88004 2357 747 15 65938 21319 0 0 1725 0
Western Sahara 0 0 10 0 1 0 8 1 0 0 1 10
Yemen 0 0 2197 6 619 0 1514 64 0 0 7 28
Zambia 0 0 17665 18 357 0 17059 249 0 0 96 2
Zimbabwe 0 0 10129 179 277 1 8643 1209 0 0 68 2

951596 rows × 10 columns

The benefit of doing this all at once, in one DataFrame, should quickly become apparent. We can now use simple selection and grouping to “ask” almost anything about our dataset.

As an example, let’s look at just the US portion of the dataset. We’ll use the pandas selection .xs:

us = df.xs("US", level="Country_Region")
us
Confirmed ΔConfirmed Deaths ΔDeaths Recovered Active Incidence_Rate Case-Fatality_Ratio Incident_Rate Case_Fatality_Ratio
Last_Update Province_State Admin2
2020-01-22 Washington 0 1 0 0 0 0 1 0 0 0 0
2020-01-23 Washington 0 1 0 0 0 0 1 0 0 0 0
2020-01-24 Chicago 0 1 0 0 0 0 1 0 0 0 0
Washington 0 1 0 0 0 0 1 0 0 0 0
2020-01-25 Illinois 0 1 0 0 0 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
2020-12-01 Wyoming Teton 1655 11 2 0 0 1653 0 0 7053 0
Uinta 1089 19 4 0 0 1085 0 0 5384 0
Unassigned 0 0 9 9 0 0 0 0 0 0
Washakie 468 27 8 0 0 460 0 0 5996 1
Weston 409 12 2 1 0 407 0 0 5904 0

799274 rows × 10 columns

Notice we have counties (early datasets just have one “county” called "0"). If we were only interested in states, we can group by the remaining levels and sum out the "Admin2" (county and similar) dimension:

by_state = us.groupby(level=("Last_Update", "Province_State")).sum()
by_state
Confirmed ΔConfirmed Deaths ΔDeaths Recovered Active Incidence_Rate Case-Fatality_Ratio Incident_Rate Case_Fatality_Ratio
Last_Update Province_State
2020-01-22 Washington 1 0 0 0 0 1 0 0 0 0
2020-01-23 Washington 1 0 0 0 0 1 0 0 0 0
2020-01-24 Chicago 1 0 0 0 0 1 0 0 0 0
Washington 1 0 0 0 0 1 0 0 0 0
2020-01-25 Illinois 1 0 0 0 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-12-01 Virginia 240063 2228 4093 31 0 235970 0 0 386545 205
Washington 167216 2197 2805 31 0 164411 0 0 86756 37
West Virginia 48818 976 758 23 0 48060 0 0 134450 52
Wisconsin 416365 4635 3611 117 0 412754 0 0 504732 30
Wyoming 33805 500 239 24 0 33575 0 0 122846 11

16358 rows × 10 columns

Using the same selector as before, we can pick out North Carolina:

by_state.xs("North Carolina", level="Province_State")
Confirmed ΔConfirmed Deaths ΔDeaths Recovered Active Incidence_Rate Case-Fatality_Ratio Incident_Rate Case_Fatality_Ratio
Last_Update
2020-03-10 7 0 0 0 0 7 0 0 0 0
2020-03-11 7 0 0 0 0 7 0 0 0 0
2020-03-12 15 8 0 0 0 15 0 0 0 0
2020-03-13 17 2 0 0 0 17 0 0 0 0
2020-03-14 24 7 0 0 0 24 0 0 0 0
... ... ... ... ... ... ... ... ... ... ...
2020-11-27 354514 8008 5210 72 0 349304 0 0 347832 127
2020-11-28 357958 3444 5219 9 0 352739 0 0 350853 126
2020-11-29 361778 3820 5240 21 0 356538 0 0 354434 126
2020-11-30 364512 2734 5261 21 0 359251 0 0 357016 125
2020-12-01 367395 2883 5284 23 0 362111 0 0 359678 125

267 rows × 10 columns

We can look at all of US, as well:

all_states = by_state.groupby(level="Last_Update").sum()
all_states
Confirmed ΔConfirmed Deaths ΔDeaths Recovered Active Incidence_Rate Case-Fatality_Ratio Incident_Rate Case_Fatality_Ratio
Last_Update
2020-01-22 1 0 0 0 0 1 0 0 0 0
2020-01-23 1 0 0 0 0 1 0 0 0 0
2020-01-24 2 0 0 0 0 2 0 0 0 0
2020-01-25 2 0 0 0 0 2 0 0 0 0
2020-01-26 5 0 0 0 0 5 0 0 0 0
... ... ... ... ... ... ... ... ... ... ...
2020-11-27 13088821 205557 264858 1404 4947446 12824526 0 0 14094193 4266
2020-11-28 13244687 155866 266047 1189 5023063 12979209 0 0 14272210 4262
2020-11-29 13383320 138633 266873 826 5065030 13117021 0 0 14414517 4235
2020-11-30 13541221 157901 268045 1172 5146319 13273736 0 0 14584892 4210
2020-12-01 13721304 180083 270642 2597 5226581 13451211 0 0 14770805 4235

315 rows × 10 columns

US total cases

Let’s try a simple plot first; this is the one you see quite often.

plt.figure(figsize=(10,5))
all_states.Confirmed.plot(logy=True, style='o');

png

Italy, new cases per day

As another example, let’s view the new cases per day for Italy. We will add a rolling mean, just to help guide the eye through the fluctuations - it is not a fit or anything fancy.

interesting = df.xs("Italy", level="Country_Region").groupby(level="Last_Update").sum()

plt.figure(figsize=(10,5))
interesting.ΔConfirmed.rolling(5, center=True).mean().plot(style='-', label='Rolling mean')
interesting.ΔConfirmed.plot(style='o', label="Data")
plt.ylabel("New cases per day")
plt.legend();

png

Italy, transmission rate

It’s more interesting to instead look at the transmission rate per day, which is new cases / active cases. The colors in the plot start changing when Italy implemented a lockdown on the 11th, and change over 14 days, which is roughly 1x the time to first symptoms. The lockdown make take longer than that to take full effect. There were several partial steps taken before the full lockdown on the 4th and 9th. Notice the transmission is slowing noticeably!

interesting = df.xs("Italy", level="Country_Region").groupby(level="Last_Update").sum()
growth = interesting.ΔConfirmed / interesting.Active
growth = growth['2020-02-24':]

# Color based on lockdown (which happened in 3 stages, 4th, 9th, and 11th)
lockdown = growth.index - pd.Timestamp('2020-03-11')
lockdown = np.clip(lockdown.days, 0, 14) / 14

fix, ax = plt.subplots(figsize=(10,5))
ax.scatter(growth.index, growth, cmap='cool', c=lockdown)

ax.set_ylabel("new cases / active cases")

#set ticks every week
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
#set major ticks format
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

png

US, transmission rate

Same plot for the US. The colors in the plot start changing when the US started the 15 plan to slow the spread, and change over 14 days, which is roughly 1x the time to first symptoms. Each state has implemented different guidelines, so the effect will be spread out even futher. Again, we are see the effect of the lockdown!

interesting = df.xs("US", level="Country_Region").groupby(level="Last_Update").sum()
growth = interesting.ΔConfirmed / interesting.Active
growth = growth['2020-03-01':]

# Not really a full lockdown, just a distancing guideline + local lockdowns later
lockdown = growth.index - pd.Timestamp('2020-03-15')
lockdown = np.clip(lockdown.days, 0, 14) / 14

fix, ax = plt.subplots(figsize=(10,5))
ax.scatter(growth.index, growth, cmap='cool', c=lockdown)

ax.set_ylabel("new cases / active cases")

#set ticks every week
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
#set major ticks format
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

png

comments powered by Disqus