In this course, we will use Covid-19 data to become familiar with Julia’s syntax and functioning.
In this lesson, we will write our first Julia script to load and transform data.
As we do so, we will discuss variables and types in Julia.
Julia scripts are text files and—by convention—have that extension.
First, we need to load some packages.
This will take some time as Julia will pre-compile newly installed or updated packages. Next time you load these packages, it will go a lot faster.
using CSV
using DataFrames
using Dates
using JLD
Dates
is a package from the standard Julia library (it was installed when you installed Julia).
The other packages are packages that you should have installed
.
In Julia, variables are names bound to a values.
These names are extremely flexible and can use any Unicode character.
The only rules are:
The Julia Style Guide recommends the following conventions:
The first variable we are creating is bound to a string.
That string is the path (in your system) of the file time_series_covid19_confirmed_global.csv 1
relative to the directory in which your Julia session is running
or
its absolute path.
1(Part of the data that you should have installed.)
This is what it looks like on my system
(replace it with the proper path on your machine!)
file = "../../data/covid/csse_covid_19_data/csse_covid_19_time_series/" *
"time_series_covid19_confirmed_global.csv"
Note:
*
in Julia allows string concatenation, so I used it to break the very long path name. You don’t have to do that.
You might have noticed that Julia returns the value, even when you assign it to a variable (this is different from the behaviour of R and Python).
To avoid this, add a semi-colon (;
) at the end:
file = "../../data/covid/csse_covid_19_data/csse_covid_19_time_series/" *
"time_series_covid19_confirmed_global.csv";
I mentioned that our first variable was a string.
So let’s talk about types in Julia.
Note that variables don’t have types since they are simply names bound to values. Values have types.
Type safety (catching errors of inadequate type) performed at compile time
Type safety done at runtime
Julia’s type system is dynamic (types are unknown until runtime), but types can be declared, optionally bringing the advantages of static type systems.
This gives users the freedom to choose between an easy and convenient language, or a clearer, faster, and more robust one (or a combination of the two).
To know the type of an object, use typeof
Done with ::
<value>::<type>
Example:
2::Int
Now that we have the path of our file, we can create a new variable.
This one is a dataframe and holds the confirmed Covid-19 data:
dat = DataFrame(CSV.File(file))
Some useful functions:
typeof(dat)
names(dat)
size(dat)
nrow(dat)
ncol(dat)
Without copying (changes made to it will change dat ):
dat[!, 1]
dat[!, "Province/State"]
dat[!, :"Province/State"]
dat."Province/State"
Making a copy (changes made to it will not change dat
):
dat[:, 1]
typeof(dat."Province/State")
This weird type is not what we want. We want a String
The function string converts a value to a String
Before applying it to our vector, let’s play with this function a little:
string([1, 2, 3])
Let’s get rid of columns we won’t use and bring the country column to the left:
select!(dat, vcat(2, 1, collect(5:ncol(dat))))
The function rename uses dictionaries:
rename!(dat, Dict(1 => :country, 2 => :province))
or
rename!(dat, Dict([(1, :country), (2, :province)]))
Let’s transform our data frame into long format:
datlong = stack(dat, Not([:country, :province]),
variable_name = :date,
value_name = :confirmed)
This does not look good:
datlong.date
We want the date to look like YYYY-MM-DD
and to be of type Date
datlong.date = Date.(replace.(string.(datlong.date),
r"(.*)(..)$" => s"\g<1>20\2"),
"m/dd/yy")
Let’s have a look at our final cleaned data frame:
datlong
In the next session, you will start from here.
There are various approaches to do this:
Let’s do that third option.
For this, we are using the package JLD which allows to save and load Julia data in .jld files.
Note that a single .jld
file can contain several objects.
save("covid.jld", "confirmed", datlong)
This will save the file covid.jld
in the working directory of the REPL.
You can save it elsewhere by giving an absolute or relative path instead of just a file name. For instance, on my machine, this is where I am saving it:
save("../../data/covid.jld", "confirmed", datlong)