First Data Analysis Project: Covid-19 Data

Recently, I have enrolled in the “Python for Everybody” specialisation course on Coursera. My aim was to learn the basics of Python and be able to use it in a data analysis environment. Ideally, I wanted to be able to conduct a simple analysis using the tools learned in this course.

And I did that, and much more! I recommend this course for anyone willing to learn more about Python, SQL, XML and JSON, in a simple and accessible way. Having learned and used C during my thesis, I found Python much easier and very easy to understand.

The last module of the course was a “Capstone” module that involved working on a data analysis project of our choice. We had to choose the data source, build a database, extract information from it and visualise our results. We were free to use whatever tools and data source we wanted.

As you can see from the title, I decided to work on data about Covid-19. This pandemic has been weighing on my mind heavily, and I thought that understanding the numbers and what they mean would help me cope with this situation.

So, I aimed to answer one simple question:

What countries are most affected by Covid-19?

The Data

The data I used was sourced from the European Centre for Disease Prevention and Control. I chose this source because they had data available in several formats, which would allow me to play with what I’ve learned from the course.

I decided to use the JSON format, which seemed the most accessible for me.

The formatting of the file is very clean:

Covid-19 JSON data file

For each record, there’s a date, a number of cases, a number of deaths, the name of the country, a “geoID” and a “countryterritoryCode” which identify the country, and a population number (based on data from 2018). This organisation will allow me to parse this file very easily using a Python code.

Building the Database

I decided to organise the database into 3 tables:

  • Countries: it contains the name of the countries and their populations
  • Dates: it contains the date strings, and separate columns for year, month and day as integers (I didn’t use it here, but it would allow me to order data by date to get the situation for one single country over the weeks)
  • Info: it contains the number of cases and the number of deaths, per country per date.

I used the code “databasebuilder.py” to build this database in SQLite named “covid.sqlite”.

When done, the records table looks like this:

Covid-19 database in SQLite

Extracting Information

Now, the fun part: the analysis and visualisation. I separated the analysis between deaths and cases (deaths.py and cases.py), as observing the data made me realise that countries with most cases don’t necessarily have the most deaths (and vice versa).

As the process is the same for both codes, I’m going to talk about the cases.py code. This code extracts the cases information from the database, and calculate the total number of cases per country, which is then stored in a list that I can order by number of cases:

Covid-19 terminal output

Visualisation

Once this information is extracted, it is written in a JS file “casestotal.js”:

Covid-19 js file to be used for visualisation.

This code is then read by an htm code for the visualisation. I used the Google Charts packages, in particular their column charts tool. It gave me the following chart as a result:

The same process was applied for the number of deaths, which resulted in this chart:

The Analysis

From these two charts, I can conclude that, currently, the country most affected by Covid-19 in absolute numbers is the United States of America.

However, if I look at the second country in terms of deaths and third in terms of cases, Italy, I can immediately see the problem with these charts: these two countries have a very different population! Therefore, I need to modify my code to take in account relative numbers to the population in each country. I’ve used the same code, and filled the information into a different js file: “casesrelative.js” and “deathsrelative.js”, which I visualised in htm using the same package.

From these charts, I can conclude that as of the 12th of April 2020, Spain is the country most affected by Covid-19, with more cases and deaths per capita than any other country.

Further Thoughts

This is a very simple analysis from a simple database. To be more complete, I would need to take in account more parameters, including:

  • The rate of Covid-19 tests performed in each countries
  • Infrastructures available, such as the number of doctors per capita
  • Response from the country to the pandemic, i.e. if the country went on lockdown, what kind of restrictions were applied, etc.

It is important to bear in mind that the situation evolves rapidly in many countries, which means that my findings will likely change over time.

Links

Leave a Reply

Your email address will not be published. Required fields are marked *