0.4099583491000567Lab 2: Julia Quickstart
Functions, Logic, and Packages
Overview
In this lab we will learn how to work with tabular data in Julia. Specifically, you will get some experience using:
- DataFrames.jlto store tabular data as a DataFrame
- CSV.jlto read CSV files and convert them to DataFrames
- DataFramesMeta.jlto manipulate DataFrames
- Plots.jland- StatsPlots.jlto create visualizations
For those of you who took CEVE 543, you’ll find most of this familiar! If you find this challenging (e.g., if you’re new to programming) please look at the Resources page which has links to tutorials that you can use to get up to speed. Some of you will work extra hard this week, and some of you will have an easier week.
This repository
In this repository you will find two files.
- index.qmdis the source code for the rendered document on the course website that you may be looking at now. That’s this file!
- template.qmdis the template for your lab submission. You should edit it directly.
Submission
As with Lab 01, you should:
- push your final code to GitHub (I’ll be able to see it via GitHub classroom)
- submit your rendered PDF or DOCX file to Canvas
Setup
Here are some instructions for getting this lab working.
Clone the repository
First, you’ll need to clone this repository to your computer. As with Lab 01, I recommend to use GitHub Desktop or the built-in Git support in VS Code. Remember to use the link from Canvas (classroom.github.com/...).
Next, open the repository in VS Code (you can do this directly from GitHub desktop if you’d like). All instructions from here assume you’re in the root directory of the repository.
Install required packages
As we saw in Lab 01, Julia is a modular language with code in packages. Compared to a language like Python, the packages in Julia typically have a narrower scope (for example, instead of a single Pandas package that does everything, there are separate packages for reading CSV files, defining dataframes, using clear syntax for data manipulation, etc.). When we’re working with a new lab, we’ll need to first install the packages we need.
- Open the command palette and select Julia: Start REPL
- In the Julia REPL, type ]to enter package manager mode
- Type activate .to activate the project environment
- Type instantiateto install the packages listed in theProject.tomlfile. This may take a few minutes.1
Rendering and previewing
As we saw in Lab 01, Quarto lets you convert a text file into an output. We’ve seen PDF, DOCX, and HTML (website) outputs, but there are other options as well.
There are many valid workflows, but my favorite is often to preview the document in VS Code while I’m working on it, and then render it when I’m done. To preview the document, run the following in your terminal (use the command palette in VS and then select “Terminal: Open a new terminal” or learn the shortcut keys):
You should see that a web browser opens up with the rendered document. As you make changes, they should appear in the browser automatically. You’ll see localhost:XXXX in your URL bar.
To render this document to PDF or DOCX, you have a few options. My favorite is to use the terminal again. For example, to convert to PDF:
If you run into issues, try the following two tips
- Make sure you’re typing these commands into the terminal and not into the Julia REPL
- In the Julia REPL, type ]to enter package manager mode and then typebuild IJuliato rebuild your IJulia kernel (we won’t go into details)
Getting help
If you’re getting stuck, please:
- Come up and ask me questions if we’re in lab
- Post on Canvas discussions
- If I can’t resolve your comment on Canvas, please email me to schedule a 1:1
Looking ahead
In the future, you’ll repeat these steps for every lab:
- clonethe repository to your computer
- activatethe project environment
- instantiatethe packages
- make your changes, saving and commiting regularly as you go
- pushyour changes to GitHub (you don’t have to wait until the end for this – you can- pushmultiple times)
Refresher: Quarto basics
Before diving in, let’s quickly review some Quarto basics. As we saw in the last lab, Quarto is a program that lets you combine text, code, and output in a single document. Quarto files are just text files, typically with the file extension .qmd.
By default, all the text in a Quarto file is interpreted as Markdown, a simple markup language for formatting text. You’ve probably seen Markdown before. You can create headers with ## (for a section), ### (for a subsection), and so on. You can make text italic with *italic* and bold with **bold**. For more, you can learn more about it here.
When you’re authoring your labs, you should take advantage of Markdown features!
Document metadata
If you open a Quarto file in your text editor (e.g., VS Code) or look at it on GitHub, you’ll see that the file starts with some metadata. The metadata is a set of key-value pairs that tell Quarto how to render the document. In Lab 01, you edited the author field to include your name.
LaTeX Math
As in standard Pandoc markdown, you can use LaTeX math in Quarto. For example, $\alpha$ yields \alpha. You can also use $$ to create a block equation:
renders as
P(E) = { n \choose k } p ^k (2-p) ^ {n-k}
For more, see the “Typesetting Math” section of the resources page.
Source code
Sometimes we want to provide example code in our documents. This is code that is not meant to be run, but is just there to illustrate a point. We do that by wrapping the code in ```. For example:
yields
f(x) = 1.25 * sin(2π * x / 1.5 + 0.5) + 0.25
f(2.1)You will typically want to specify the language of the code block, which will tell Quarto how to syntax highlight it. For example, see how the highlighting changes when we specify julia:
Code blocks
Often, we don’t just want to show code, but we want to run it and show the output.
which yields
You can run these blocks in Julia by clicking the “Run Cell” button, or by pressing the keyboard shortcut (to see it, open the command palette and search for “Run Cell”). For more on Julia, see here.
Citations
You can add citations in Quarto. The easiest way is to export a bibliography from Zotero, and then add it to your Quarto document. You can use the Zotero Better BibTeX plugin to export a .bib file.
See here for instructions on using references with Quarto or see the website code for an example. I’ll provide a template for your final project.
Julia Quickstart
Loading packages
In Julia we say using to import a package. By convention we’ll put these at the top of our script or notebook in alphabetical order. When you run this cell, you’ll see a bunch of activity in your REPL as Julia goes through the following steps:
- Download a file from the internet that specifies which packages depend on which other packages
- Solve an optimization problem to identify which versions of which packages (including dependencies, and their dependencies, and so on) are compatible with each other
- Download the packages and compile them (this may take a few minutes)
Read in data
We will use the CSV.jl package to read in our data.
Hover over the numbers on the right of this code for explanations.
fname = "data/tidesandcurrents-8638610-1928-NAVD-GMT-metric.csv"
df = CSV.read(fname, DataFrame)
first(df, 5)- 1
- 
We define a variable called fnamethat stores the path to our data file. Thedatafolder is in the same directory as this notebook.
- 2
- 
We use the CSV.readfunction to read in the data. The first argument is the filename, and the second argument tells Julia to convert the data to aDataFrame. We store it as a variable calleddf.
- 3
- 
We use the firstfunction to show the first 5 rows of the DataFrame.
| Row | Date Time | Water Level | Sigma | I | L | 
|---|---|---|---|---|---|
| String31 | Float64 | Float64 | Int64 | Int64 | |
| 1 | 1928-01-01 00:00 | -0.547 | 0.0 | 0 | 0 | 
| 2 | 1928-01-01 01:00 | -0.699 | 0.0 | 0 | 0 | 
| 3 | 1928-01-01 02:00 | -0.73 | 0.0 | 0 | 0 | 
| 4 | 1928-01-01 03:00 | -0.669 | 0.0 | 0 | 0 | 
| 5 | 1928-01-01 04:00 | -0.516 | 0.0 | 0 | 0 | 
This data comes from the NOAA Tides and Currents website, specifically for a station at Sewells Point, VA for the year 1928. NAVD refers to the North American Vertical Datum, which is a reference point for measuring sea level, and GMT refers to Greenwich Mean Time, which is the time zone used in the data (rather than local time).
We can see that our DataFrame has five columns, the first of which is “Date Time”. However, the “Date Time” column is being parsed as a string. We want it to be a DateTime object from the Dates package. To do that, we need to tell Julia how the dates are formatted. We could then manually convert, but CSV.read has a kewyord argument that we can use
date_format = "yyyy-mm-dd HH:MM"
df = CSV.read(fname, DataFrame; dateformat=date_format)
first(df, 3)- 1
- 
This is a string that tells Julia how the dates are formatted. For example, 1928-01-01 00:00. See the documentation for more information.
- 2
- 
dateformatis a keyword argument whiledate_formatis a variable whose value is"yyyy-mm-dd HH:MM". We could equivalently writedateformat="yyyy-mm-dd HH:MM".
| Row | Date Time | Water Level | Sigma | I | L | 
|---|---|---|---|---|---|
| DateTime | Float64 | Float64 | Int64 | Int64 | |
| 1 | 1928-01-01T00:00:00 | -0.547 | 0.0 | 0 | 0 | 
| 2 | 1928-01-01T01:00:00 | -0.699 | 0.0 | 0 | 0 | 
| 3 | 1928-01-01T02:00:00 | -0.73 | 0.0 | 0 | 0 | 
The next column is “Water Level”, which is the height of the water above the reference point (NAVD) in meters. We can see that this is being parsed as a float, which is what we want 👍. However, you have to know that the data is in meters rather than inches or feet or something else. To explicitly add information about the units, we can use the Unitful package.
- 1
- 
We select the column with water levels using its name. The !means “all rows”. Thus,df[!, " Water Level"]is a vector of all the water levels stored.*=means to multiply in place. For example, ifx=2thenx *= 2is equivalent tox = x * 2..*=is a vector syntax, meaning do the multiplication to each element of the vector individually.1u"m"is aUnitfulobject that represents 1 meter. We multiply the water levels by this to convert them to meters.
| Row | Date Time | Water Level | Sigma | I | L | 
|---|---|---|---|---|---|
| DateTime | Quantity… | Float64 | Int64 | Int64 | |
| 1 | 1928-01-01T00:00:00 | -0.547 m | 0.0 | 0 | 0 | 
| 2 | 1928-01-01T01:00:00 | -0.699 m | 0.0 | 0 | 0 | 
| 3 | 1928-01-01T02:00:00 | -0.73 m | 0.0 | 0 | 0 | 
Subsetting and renaming
We want to only keep the first two (for more on the other three, see here). We can also rename the columns to make them easier to work with (spaces in variable names are annoying). To do this, we use the @rename function:
- 1
- 
The $is needed here because the right hand side is a string, not aSymbol.
Then, we can use the @select function to select the columns we want. Notice how the first argument to select is the DataFrame and the subsequent arguments are column names. Notice also that our column names were strings ("Date Time"), but we can also use symbols (:datetime).
| Row | datetime | lsl | 
|---|---|---|
| DateTime | Quantity… | |
| 1 | 1928-01-01T00:00:00 | -0.547 m | 
| 2 | 1928-01-01T01:00:00 | -0.699 m | 
| 3 | 1928-01-01T02:00:00 | -0.73 m | 
For more on what DataFramesMeta can do, see this Tweet.
Time series plot
Now we’re ready to make some plots of our data. Let’s start with a simple time series plot of the water levels. Our data is collected hourly, so we have a lot of data points! Still, we can plot them all.
plot(
    df.datetime,
    df.lsl;
    title="Hourly Water levels at Sewells Point, VA",
    ylabel="Water level",
    label=false,
)Focusing on the entire time series means we can’t dig into the details. Let’s zoom in on a single month (October 1928) using the @subset function.
t_start = Dates.DateTime(1928, 10, 1, 0)
t_end = Dates.DateTime(1928, 10, 31, 23)
df_month = @subset(df, t_start .<= :datetime .<= t_end)
first(df_month, 3)- 1
- 
This creates a DateTimeobject for the start of October 1928 at 0 hours. Defining it clearly here aids readability.
- 2
- 
This selects all the rows where the :datetimecolumn is betweent_startandt_end. The.syntax is called dot broadcasting and is a way to apply a function to each element of a vector.
| Row | datetime | lsl | 
|---|---|---|
| DateTime | Quantity… | |
| 1 | 1928-10-01T00:00:00 | 0.215 m | 
| 2 | 1928-10-01T01:00:00 | 0.429 m | 
| 3 | 1928-10-01T02:00:00 | 0.581 m | 
Now we can plot it as above:
Climatology
An essential idea in working with tabular data (and other data formats) is “split-apply-combine”. Essentially: split the data into groups, apply some function to each group, and then combine the results.
We can use this workflow to answer an interesting question: what is the average water level for each month?2 Of course, we’re only looking at one year of data here – we should ideally look at a long record!
df[!, :month] = Dates.month.(df.datetime)
dropmissing!(df, :lsl)
df_bymonth = groupby(df, :month)
df_climatology = combine(df_bymonth, :lsl => mean => :lsl_avg);- 1
- 
This creates a new column called :monththat is the month of each observation.
- 2
- 
This will discard any rows in dfthat have a missing value of:lsl. This is necessary because themeanfunction will returnmissingif any of the values are missing.
- 3
- 
This creates a GroupedDataFrameobject that contains all the data grouped by month.
- 4
- 
This takes the grouped data and calculates the mean of the :lslcolumn for each month. The general syntax iscombine(grouped_df, :column => function).
We can now plot the climatology.
plot(
    df_climatology.month,
    df_climatology.lsl_avg;
    xticks=1:12,
    xlabel="Month",
    ylabel="Average Water level",
    linewidth=3,
    label=false,
)- 1
- 
We can use df.colnameinstead ofdf[!, :colname]. The latter is more robust but the former is easier to type.
- 2
- 
Setting xtickswill set the x-axis ticks to the values in the vector. We can use this to make sure the x-axis ticks are labeled with the months.
- 3
- We can set the line width to make the plot easier to read.
All done!
Now that you’ve read through the instructions, you should:
- Ask questions about anything that doesn’t make sense to you!
- Head over to template.qmdand start working on your lab!
Footnotes
- Julia precompiles packages when they are installed, and (to a lesser extent) when they are first used. The first time you use a package it may take a moment to load. This is normal, nothing to worry about, and rapidly improving.↩︎ 
- To do a better job, we should separate out the long-term trend from the seasonal cycle. This is called de-trending and is a common technique in climate science. We can worry more about this later.↩︎