Baby steps with Covid-19 data for (Clojure) programmers
Need help with your custom Clojure software? I'm open to (selected) contract work.March 18, 2020
Please share: Twitter.
These books fund my work! Please check them out.
The Corona pandemic is on everyone's mind. If your country has not been locked down yet, it will be soon. The world and the human society is not going to be the same as before. Let's not worry too much about how economy is going to be hit, but hope that we solve it health-wise.
Anyway, soon, most of us will have to spend most of our time inside, and when we solve basic needs, and make sure our loved ones are safe, we'll have some time to pass. Most people will re-watch their favorite TV shows and re-play their games. I'm sure that many will acquire various skills of America's Got Talent quality. Some programmers, though, will itch to try their skills on Covid-19 data.
Maybe you wanted to add some machine learning skills to your programming-fu, but the day job just made that impossible. Why not start now? If this thing is going to be on our minds for months, if not years, we might throw some programming magic at it.
By now, we've seen many shiny visualizations and analyses of the pandemic, published by experts and amateurs. Maybe you itch to throw it into a magic machine learning framework and get some super-x-ray insight from artificial intelligence.
I won't do that here. First, because I know nothing about epidemiology. Please do not take any conclusions you hear from non-experts for granted. They don't know what they're talking about. Second, because the data that we can publicly access is too scarce to be thrown to any machine learning beast. You might get some numbers out, but these numbers will tell you only what's obvious from the visualizations anyway, at best, or spit out complete garbage at worst.
Once the data becomes more reliable, and abundant, we might be able to use it for some insight, provided that we learn some basics of epidemiology by then. Until then, I propose that we brush up our basic data skills through pure play.
So, there is no need to feel sorry that you hadn't learned some Big Machine Learning Framework yet. With basic programming skills, you can dissect the data that is currently available just fine, if not more easily! Just pure, plain Clojure, without specialized libraries!
Let's take some basic steps with the Covid-19 data published by Johns Hopkins University. I'll use a copy provided by Oscar Wahltinez in this Git repository.
Loading the data
The data is in a CSV file, so we require some useful namespaces for working with files.
(ns dragan.rocks.covid-19.world (:require [clojure.java.io :as io] [clojure.data.csv :as csv]))
CSV files are textual files where words are separated by comas and newlines. Typically, each line represent an observation, while the values of different variables for that observations are separated by comas. The first task is to translate that into convenient data structures in memory.
We load the file as a resource, slurp its contents into a string, and parse it into a lazy sequence.
(csv/read-csv (slurp (io/resource "open-covid-19/output/world.csv")))
(["Date" "CountryCode" "CountryName" "Confirmed" "Deaths" "Latitude" "Longitude"] ["2019-12-31" "AE" "United Arab Emirates" "0" "0" "23.424076" "53.847818"] ["2019-12-31" "AF" "Afghanistan" "0" "0" "33.93911" "67.709953"] ["2019-12-31" "AM" "Armenia" "0" "0" "40.069099" "45.038189"] ["2019-12-31" "AT" "Austria" "0" "0" "47.516231" "14.550072"] ["2019-12-31" "AU" "Australia" "0" "0" "-25.274398" "133.775136"] ["2019-12-31" "AZ" "Azerbaijan" "0" "0" "40.143105" "47.576927"] ["2019-12-31" "BE" "Belgium" "0" "0" "50.503887" "4.469936"] ["2019-12-31" "BH" "Bahrain" "0" "0" "25.930414" "50.637772"] ["2019-12-31" "BR" "Brazil" "0" "0" "-14.235004" "-51.92528"] ["2019-12-31" "BY" "Belarus" "0" "0" "53.709807" "27.953389"] ["2019-12-31" "CA" "Canada" "0" "0" "56.130366" "-106.346771"] ["2019-12-31" "CH" "Switzerland" "0" "0" "46.818188" "8.227512"] ["2019-12-31" "CN" "China" "27" "0" "35.86166" "104.195397"] ["2019-12-31" "CZ" "Czech Republic" "0" "0" "49.817492" "15.472962"] ["2019-12-31" "DE" "Germany" "0" "0" "51.165691" "10.451526"] ...)
As you can see, this sequence contains a map of vectors such as ["2019-12-31" "AT" "Austria" "0" "0" "47.516231" "14.550072"]
.
This is a sign that we successfully loaded the data. This is the data from the world.csv
file,
while there are a few more similar datasets: usa.csv
, china.csv
. We can create a convenience
function for loading these files.
(defn read-open-covid [csv-name] (csv/read-csv (slurp (io/resource (format "open-covid-19/output/%s.csv" csv-name)))))
And now use it to load the world data and stash it into a global variable (normally a bad, bad, programming practice, but acceptable if we are only playing in the REPL, notebook-style). BTW, I run this code in emacs+CIDER, and automatically generate this post from org-mode. If you copy and paste the code, it should run in any Clojure REPL setup.
(def covid-world (read-open-covid "world"))
Feeling the basic structure
Now, the most basic info I can get is "What variables does this data set have?". CSV files typically list that in the first line, and we access it as the first element in our sequence.
(first covid-world)
["Date" "CountryCode" "CountryName" "Confirmed" "Deaths" "Latitude" "Longitude"]
To see how example data looks like, let's take the second row.
(second covid-world)
["2019-12-31" "AE" "United Arab Emirates" "0" "0" "23.424076" "53.847818"]
So, date is in the YYYY-MM-DD
format, which could be convenient for sorting. There is hope
that Clojure can handle the comparisong and sorting of these strings as-is, without conversion
to proper date objects (spoiler: it does). Next is the country code of the observation, which is
obviously a useful identifier. CountryName
is redundant, but can be a fine time saver for
all of us who do not remember all country codes. Next is the official number of confirmed
cases of infection by Covid-19, and official death toll. Latitude and longitude refer to the
position of the country, and are included because this data set is used as a source for
the visualization of the pandemic on the interactive wold map that you can access here.
Unsurprisingly, on the New Year's Eve, There were no (discovered!) cases of infection in UAE.
How many observations do we have
The answer to this question is so easy to get, that I'm out of inspiration for this paragraph.
(count world-data)
5322
We have a little more than 5000 observation.
How many countries do we have this data for? To answer this question, we should access country codes for each observation and then see how many distinct codes we have. It may require more fiddling in some other programming languages, but in Clojure it's bees knees.
(count (distinct (map second world-data)))
143
So, is our data complete?
(rem (count world-data) (count (distinct (map second world-data))))
31
Apparently not, since there is a remainder in this division. Some dates are certainly missing for some countries.
This means that we can't blindly treat all data for all countries uniformly; whatever the analysis we plan to do we will have to do something about that.
How many observations are missing
First, let's see how many distinct dates there are. Today is the 18th March 2020, and I can count that by hand on the calendar, but the point here is to do that using code.
(count (distinct (map first world-data)))
79
Since there are 143 countries and 79 dates, ideally there would be this many observations:
(* (count (distinct (map first world-data))) (count (distinct (map second world-data))))
11297
Which means that we are missing half the data.
But it's not all. How many observations of the Confirmed
variable are 0
?
(count (filter zero? (map #(nth % 3) world-data)))
class java.lang.ClassCastExceptionclass java.lang.ClassCastExceptionExecution error (ClassCastException) at dragan.rocks.covid-19.world/eval14376 (form-init3222897912165483565.clj:1). java.lang.String cannot be cast to java.lang.Number
We get the exception, since "0"
and "4"
are not a numbers, but strings of characters.
Let's convert these columns to proper types:
(def world-data2 (map (fn [[d cc cn conf death]] [d cc cn (Long/parseLong conf) (Long/parseLong death)]) world-data))
#'dragan.rocks.covid-19.world/world-data2
(* (count (filter zero? (map #(nth % 3) world-data2))))
2976
In roughly half of the observations, there were no confirmed cases. But not even all zeros are equal. Some zeroes are here because the pandemic hasn't reached a country at the particular date. Some other zeros might be there because no new cases were discovered in a country that has previous case. But even that does not mean there are no new case. In my country, Serbia, on some dates no tests were done (or, perhaps, were done but haven't been published, who knows).
The point is that this data is so early, that it is very scattered and very rough.
Anyway, let's see how many data is recorded at all per each day (0
or otherwise).
(def date-freqs (sort-by first (frequencies (map first world-data2))))
(["2019-12-31" 66] ["2020-01-01" 66] ["2020-01-02" 66] ["2020-01-03" 66] ["2020-01-04" 66] ["2020-01-05" 66] ["2020-01-06" 66] ["2020-01-07" 66] ["2020-01-08" 66] ["2020-01-09" 66] ["2020-01-10" 66] ["2020-01-11" 66] ["2020-01-12" 66] ["2020-01-13" 66] ["2020-01-14" 66] ["2020-01-15" 66] ...)
At the beginning, most data is available for (probably) the same 66 countries.
Let's discover (by code) what's the first date with a different number of observations.
It seems that this 66
runs right until two weeks ago. And then?
All dates after the 3rd of March first see less observations, and then, starting with the March 11th,
the number of observations
suddenly jumps. My hunch is that at first, most countries just submitted the default 0
to whomever
collected this data (the World Health Organization, I suppose?), simply ignoring the problem.
Then, as they started to realize the immediate danger, they were reluctant to send the
invented data (or the WHO stopped collecting the default zeros?), and then, on 15th March
the data becomes more complete. My hunch is the global pandemic was officially announced sometimes before that.
Since this was in the past, I can simply check on the Internet (…typing away in the browser…): the
pandemic was announced on March 11th 2020.
How much data do we have for each particular country
Analogously to the frequencies of observations on a particular date, we can count the frequencies related to countries; instead of the first column, we will use the second.
(def country-freqs (sort-by first (frequencies (map second world-data2))))
(["AD" 4] ["AE" 72] ["AF" 68] ["AG" 1] ["AL" 9] ["AM" 69] ["AR" 11] ["AT" 78] ["AU" 78] ["AZ" 71] ["BA" 5] ["BD" 3] ["BE" 78] ["BF" 5] ["BG" 8] ["BH" 77] ...)
Selecting your country
The human eye quickly gets lost in this bunch of numbers. Let's create a function that selects only the data available for the country, or a set of countries, that we are interested in.
For, example, for this set of countries: #{"IT" "FR" "ES" "CN"}
(filter (fn [[_ code]] (#{"IT" "FR" "ES" "CN"} code)) world-data2)
(["2019-12-31" "CN" "China" 27 0] ["2019-12-31" "ES" "Spain" 0 0] ["2019-12-31" "FR" "France" 0 0] ["2019-12-31" "IT" "Italy" 0 0] ["2020-01-01" "CN" "China" 27 0] ["2020-01-01" "ES" "Spain" 0 0] ["2020-01-01" "FR" "France" 0 0] ["2020-01-01" "IT" "Italy" 0 0] ["2020-01-02" "CN" "China" 27 0] ["2020-01-02" "ES" "Spain" 0 0] ["2020-01-02" "FR" "France" 0 0] ["2020-01-02" "IT" "Italy" 0 0] ["2020-01-03" "CN" "China" 44 0] ["2020-01-03" "ES" "Spain" 0 0] ["2020-01-03" "FR" "France" 0 0] ["2020-01-03" "IT" "Italy" 0 0] ...)
We'll write some convenient functions for computing the previously discussed values.
(defn take-countries [data country-set] (filter (fn [[_ code]] (country-set code)) data))
(defn date-freqs [data] (sort-by first (frequencies (map first data))))
(defn country-freq [data] (sort-by first (frequencies (map second data))))
(def my-countries (country-freq (take-countries world-data #{"IT" "FR" "ES" "CN" "US" "RS" "DE"})))
Now we can see that most of these countries have pretty complete (if not overly reliable) data, while Serbia only recently started doing tests and reporting some numbers.
CN | 78 |
DE | 78 |
ES | 78 |
FR | 78 |
IT | 79 |
RS | 8 |
US | 78 |
Draw some plots
Instead of flashy plotting libraries, I'll draw some ASCII art. The reason is that the data is so obvious although coarse, that I don't want to make a false impression that you'll learn anything new that you haven't already seen in the news and on the Internet.
The second is: we are programmers, we present data in any silly way that we please!
I selected a pretty basic Java ASCII plotting library after a quick search on GitHub. Great thanks to Mitch Talmadge for ASCII-Data :)
(import 'com.mitchtalmadge.asciidata.graph.ASCIIGraph)
First I'll just take the number of confirmed cases from Serbia, and remove whatever zeros there are before the first case (we are not interesting in plotting a flat line).
(drop-while zero? (map #(nth % 3) (take-countries world-data2 #{"RS"})))
1 | 5 | 18 | 24 | 41 | 46 | 55 | 57 |
(def rs-data (drop-while zero? (map #(nth % 3) (take-countries world-data2 #{"RS"}))))
Let's plot this.
(println (.plot (ASCIIGraph/fromSeries (double-array rs-data))))
nil
57.00 ┤ ╭ 56.00 ┤ │ 55.00 ┤ ╭╯ 54.00 ┤ │ 53.00 ┤ │ 52.00 ┤ │ 51.00 ┤ │ 50.00 ┤ │ 49.00 ┤ │ 48.00 ┤ │ 47.00 ┤ │ 46.00 ┤ ╭╯ 45.00 ┤ │ 44.00 ┤ │ 43.00 ┤ │ 42.00 ┤ │ 41.00 ┤ ╭╯ 40.00 ┤ │ 39.00 ┤ │ 38.00 ┤ │ 37.00 ┤ │ 36.00 ┤ │ 35.00 ┤ │ 34.00 ┤ │ 33.00 ┤ │ 32.00 ┤ │ 31.00 ┤ │ 30.00 ┤ │ 29.00 ┤ │ 28.00 ┤ │ 27.00 ┤ │ 26.00 ┤ │ 25.00 ┤ │ 24.00 ┤ ╭╯ 23.00 ┤ │ 22.00 ┤ │ 21.00 ┤ │ 20.00 ┤ │ 19.00 ┤ │ 18.00 ┤ ╭╯ 17.00 ┤ │ 16.00 ┤ │ 15.00 ┤ │ 14.00 ┤ │ 13.00 ┤ │ 12.00 ┤ │ 11.00 ┤ │ 10.00 ┤ │ 9.00 ┤ │ 8.00 ┤ │ 7.00 ┤ │ 6.00 ┤ │ 5.00 ┤╭╯ 4.00 ┤│ 3.00 ┤│ 2.00 ┤│ 1.00 ┼╯
Whoaaa. Although the numbers looked pretty tame, graphs shoots up in the skies. This is because the growth is exponential.
Since the exponential function grows really fast, the lower numbers quickly become miniscule. However, we are not interested in absolute numbers, but in growth. Therefore, it is more appropriate to take the logarithm of this function, and see whether the logarithm starts to drop off, if only for a tiny bit.
We need a convenience log
function. I could have imported one from Neanderthal, but a fast
CPU and GPU library is clearly an overkill for such a task. Hopefully soon there will be abundance
of data, and we'll be able to put these nuclear options to use. For now, let's use sticks and stones.
(defn log ^double [^double x] (Math/log x))
#'dragan.rocks.covid-19.world/log
Now, plot the logarithm of the function of interest.
(println (.plot (ASCIIGraph/fromSeries (double-array (map log rs-data)))))
nil
It grows quickly, and it is only at the beginning.
4.04 ┤ ╭─── 3.03 ┤ ╭─╯ 2.02 ┤╭╯ 1.01 ┤│ 0.00 ┼╯
Italy is overwhelmed
Now, let's see how Italy is holding. For a few weeks we've listened to really bad news.
(defn extract-data [country-code] (drop-while zero? (map #(nth % 3) (take-countries world-data2 #{country-code}))))
(take 5 (reverse (map log (extract-data "IT"))))
10.357933282865915 | 10.239245248219472 | 10.12414802355653 | 9.959726098983317 | 9.77905747415795 |
(println (log-plot (extract-data "IT")))
It hasn't started to slow down yet, although it looks like it is about to.
dragan.rocks.covid-19.world=> (println (log-plot (extract-data "IT"))) 10.36 ┤ ╭─── 9.39 ┤ ╭────╯ 8.42 ┤ ╭────╯ 7.46 ┤ ╭───╯ 6.49 ┤ ╭──╯ 5.53 ┤ ╭──╯ 4.56 ┤ ╭╯ 3.59 ┤ │ 2.63 ┤ ╭╯ 1.66 ┤ │ 0.69 ┼─────────────────────╯
China is slowing down
And China already won this battle, and, I hope, war itself.
(take 5 (reverse (map log (extract-data "CN"))))
11.303808085389111 | 11.302451316756681 | 11.302142703354239 | 11.301871044753339 | 11.301636371103024 |
See how the numbers are rising slowly on the log scale. The absolute numbers are still bad, but each day they are less bad.
(log-plot (extract-data "CN"))
" 11.30 ┤ ╭─────────────────────────────────\n 10.30 ┤ ╭────────╯ \n 9.30 ┤ ╭────╯ \n 8.30 ┤ ╭──╯ \n 7.30 ┤ ╭─╯ \n 6.30 ┤ ╭───╯ \n 5.30 ┤ ╭─╯ \n 4.30 ┤ ╭─────────────╯ \n 3.30 ┼────╯ \n"
11.30 ┤ ╭───────────────────────────────── 10.30 ┤ ╭────────╯ 9.30 ┤ ╭────╯ 8.30 ┤ ╭──╯ 7.30 ┤ ╭─╯ 6.30 ┤ ╭───╯ 5.30 ┤ ╭─╯ 4.30 ┤ ╭─────────────╯ 3.30 ┼────╯
We can calculate each change explicitly, and see this directly.
(defn absolute-plot [series-data] (.plot (ASCIIGraph/fromSeries (double-array series-data)))) (println (absolute-plot (map #(/ % 1000) (reduce (fn [acc x] (conj acc (- x (peek acc)))) [0] (extract-data "CN")))))
#'dragan.rocks.covid-19.world/absolute-plotnilBoxed math warning, *Org-Babel Preview Corona-1-Baby-steps-with-Covid-19-for-programmers.org[ clojure ]*:5:31 - call: public static java.lang.Number clojure.lang.Numbers.divide(java.lang.Object,long). Boxed math warning, *Org-Babel Preview Corona-1-Baby-steps-with-Covid-19-for-programmers.org[ clojure ]*:7:51 - call: public static java.lang.Number clojure.lang.Numbers.unchecked_minus(java.lang.Object,java.lang.Object).
46.69 ┤ ╭╮╭╮╭╮╭╮╭╮╭╮╭╮╭╮ 45.69 ┤ ╭╮╭╮││││││││││││││││ 44.70 ┤ ╭╮╭╮││││││││││││││││││││ 43.71 ┤ ╭╮││││││││││││││││││││││││ 42.71 ┤ ╭╮││││││││││││││││││││││││││ 41.72 ┤ ╭╮││││││││││││││││││││││││││││ 40.73 ┤ ││││││││││││││││││││││││││││││ 39.73 ┤ ╭╮││││││││││││││││││││││││││││││ 38.74 ┤ ││││││││││││││││││││││││││││││││ 37.75 ┤ ││││││││││││││││││││││││││││││││ 36.75 ┤ ╭╮││││││││││││││││││││││││││││││││ 35.76 ┤ ││││││││││││││││││││││││││││││││││ 34.77 ┤ │││││││││││││││││││││││││││╰╯╰╯╰╯╰ 33.77 ┤ │││││││││││││││╰╯╰╯╰╯╰╯╰╯╰╯ 32.78 ┤ │││││││││╰╯╰╯╰╯ 31.79 ┤ │││││││╰╯ 30.79 ┤ │││││╰╯ 29.80 ┤ │││││ 28.81 ┤ │││╰╯ 27.81 ┤ │││ 26.82 ┤ │╰╯ 25.83 ┤ │ 24.83 ┤ │ 23.84 ┤ │ 22.85 ┤ ╭╯ 21.85 ┤ ╭╯ 20.86 ┤ ╭╯ 19.87 ┤ │ 18.87 ┤ ╭╯ 17.88 ┤ ╭╯ 16.89 ┤ ╭╯ 15.89 ┤ │ 14.90 ┤ ╭╯ 13.91 ┤ │ 12.91 ┤ ╭╯ 11.92 ┤ │ 10.93 ┤ ╭╯ 9.93 ┤ ╭╯ 8.94 ┤ │ 7.95 ┤ ╭╯ 6.95 ┤ ╭╯ 5.96 ┤ │ 4.97 ┤ ╭─╯ 3.97 ┤ │ 2.98 ┤ ╭─╯ 1.99 ┤ ╭╯ 0.99 ┤ ╭─╯ 0.00 ┼─────────────────────────╯
Programmers, learn Machine Learning!
I hope this was easy and interesting, and it occupied your attention away from the news for at least some time.
Simple tools really make you think about the problem, so the flashy new tools are not necessarry when you're just starting.
Although Machine Learning may look like a high mountain to climb, I hope this post proved to you that you made that first step long ago, with your first steps in programming!