3 Madison Lakes

3.1 Lake Mendota Freezing and Thawing

Lake Mendota is the largest of the four lakes in Madison, Wisconsin. The University of Wisconsin sits on part of its southern shore. The lake is over five miles long from east to west and about four miles wide from north to south at its widest point. The surface area of the lake is about 4000 hectares (15.5 square miles, about 10,000 acres).

Each winter, Lake Mendota freezes. Some winters, there are multiple periods where the lake freezes, thaws, and then freezes again. Due to its proximity to the University of Wisconsin, the lake has been heavily studied. Scientists have noted since the 1850s the dates each winter that the Lake Mendota (and other Madison lakes) freeze and thaw. The Wisconsin State Climatology Office maintains the records, in recent decades with assistance from the UW-Madison Department of Atmospheric and Oceanic Sciences.

3.1.1 Criteria for freezing/thawing

Officially, Lake Mendota is considered to be closed if more than half the surface is covered by ice and open otherwise. The determination of dates when the lakes first closes and subsequently opens attempts to follow protocols that originated during the middle of the 1800s, where the determination was based on observations from several vantage points. For a change in status (open to closed / closed to open) to be deemed official, it needs to persist until the next day. There is admittedly some subjectivity in the determinations of the dates, but this subjectivity rarely affects the determination of the date by more than one day. On Lake Mendota, there is an additional criterion (see http://www.aos.wisc.edu/~sco/lakes/msn-lakes_instruc.html).

Determining the opening and closing dates for Lake Mendota is more of a challenge because the length and shape of the lake would require a sufficiently high vantage point that was not readily available to 19th century observers. Partly because Lake Mendota has a more irregular shoreline, an important secondary criterion applies for that lake: whether one can row a boat between Picnic Point and Maple Bluff. This rule arose from the era of E. A. Birge and Chancey Juday (according to Reid Bryson, founder of the UW Meteorology Dept., now known as the Dept. of Atmospheric and Oceanic Sciences), because they frequently were out on the lake in a rowboat, and the ice along that line determined if they could transport a case of beer over to their friends in Maple Bluff.

3.1.2 Map

The University of Wisconsin—Madison campus sits on the south shore of Lake Mendota. The red dashed line in the map below connects Picnic Point and Maple Bluff. Most photographs below are taken from Picnic Point in the direction of this line.

3.1.3 Winter of 2020-2021

During the 2020-2021 winter, Lake Mendota was officially declared as closed by ice on January 3, 2021 and it reopened again on March 20, 2021. The following images show the view from Picnic Point toward Maple Bluff before and after these dates. Fortuitously, there was snowfall on January 4 which gathered on the ice surface but melted in the open water, making it easy to observe the boundaries.

Early and Mid December

There was no ice on the surface of Lake Mendota near Picnic Point.

Maple Bluff from Picnic Point on December 2, 2020

Maple Bluff from Picnic Point on December 17, 2020

Early January

On January 2, the last day that Lake Mendota was observed as open, thin sheets of ice are beginning to form on the line from Picnic Point toward Maple Bluff. There was still much open water between Picnic Point and the capitol and UW campus, but ice extended several hundred feet from the southern shores of the lake and over shallow bays.

Maple Bluff from Picnic Point on January 2, 2021

By January 5, almost the entire lake was covered with ice. A small region with an area less than 1% of the total lake surface near the region of Picnic Point was still open, as can be seen in this photo. However, much of the path between Picnic Point and Maple Bluff is ice covered.

Maple Bluff from Picnic Point on January 5, 2021

A photo from higher up near the end of Picnic Point shows that most of the entire northern body of the lake is ice covered. Although not pictured, this part of the lake was nearly all open water on January 2.

North from above Picnic Point on January 5, 2021

A second photo shows the northern part of Lake Mendota from a vantage point about a quarter mile west of Picnic Point on January 5. On January 2, the ice only extended about 100-200 feet from the shore, but by January 5, the entire visible surface of the lake from this vantage point is covered with ice for miles to the far shore.

North from the Path to Picnic Points on January 5, 2021

Mid March

The day before Lake Mendota was declared open, there was much open water, but there remained a sheet of ice along the shore by Maple Bluff, barely visible in the background of the photo. In the foreground, a large pile of ice which had been blown into shore the previous evening is visible.

Maple Bluff from Picnic Point on March 19, 2021

The day after Lake Mendota was declared open, the path from Picnic Point to Maple Bluff was completely free of ice.

Maple Bluff from Picnic Point on March 21, 2021

3.2 Lake Mendota Questions

As we analyze the historical Lake Mendota data, we will be looking for patterns for how various aspects of freezing and thawing have changed over time. A number of potential motivating questions are the following:

  1. Has the total duration of the lake being closed with ice over a single winter changed; if so how?
  2. How has the typical date that the lake first freezes and last thaws changed over time?
  3. Are trends in total durations of being closed with ice by year adequately explained by modeling with a straight line, or is a curve discernibly better?
  4. By how much do individual observations in a given winter tend to vary from the overall trend?
  5. What do we predict might happen in future years in terms of total duration of being closed by ice or dates of the first closed date or last open date?
  6. Is there evidence of a changing climate apparent in this data?

In the rest of this chapter, we explore the first of these questions.

3.2.1 Lake Mendota Data

The Lake Mendota data is shared on a Department of Atmospheric and Oceanic Sciences website http://www.aos.wisc.edu/~sco/lakes/Mendota-ice.html. This data is not directly machine readable into a nice format. The file lake-mendota-raw-2021.csv contains the data from this web page through the 2020–2021 winter, formatted by hand to be machine readable with slight differences in how the information is stored. This data format is called a comma-separated-variable, or CSV, file. The first row contains the column headers and data is in subsequent rows. Each variable (piece of data) is separated from the others with a comma. Typically, all rows contain the same variables, and so will have the same number of commas. The first few lines of the file appear like this.

winter,closed,open,days
1852-53,NA,5 Apr,NA
1853-54,27 Dec,NA,NA
1854-55,NA,NA,NA
1855-56,18 Dec,14 Apr,118
1856-57,6 Dec,6 May,151

The symbol NA represents missing data. There are some years where the data is recorded on multiple rows. For example, in the winter of 1936–1937, the lake was closed with ice on December 7, then reopened three weeks later on December 30 before closing again on January 13. This second interval lasted much longer and the lake reopened on April 13. Lake Mendota was closed with ice for a total of 121 days during this winter.

1936-37,7 Dec,30 Dec,NA
1936-37,5 Jan,13 Apr,121

3.2.2 Wrangling the Lake Mendota Data

To address our first question, we first need to wrangle with the data to put it into a more useful form. We want a single row for each winter, a variable we can plot on the x axis, such as the first year of the winter, and the total duration that the lake is closed with ice each winter. Rather than trusting the values given, we can recalculate the length of time in days between the two dates on each line and then sum the totals for each year.

To do all of these tasks, we will use tools from a variety of the tidyverse packages. The readr package has functions to read in data from various formats, including CSV files. The dplyr package has many functions to wrangle data, modifying variables and doing various summaries and recalculations. To get the numerical value of the first year of the winter, we may use tools from the stringr package which manipulates strings. A string, in computer science, is an array of characters, such as “1936-37”. To calculate the duration that the lake is closed during each interval, we need to interpret strings such as “7 Dec” and “5 Jan” as dates from the appropriate year (the first year for December and the second year for January). The lubridate package has many functions to work with dates. Each of these packages is described in the R for Data Science textbook and in a later chapter of these course notes.

In summary, the first task is to create a new file where each row contains a single interval when the lake is closed by ice, as in the raw data, but with some transformed variables and some additional ones, discarding rows with missing data. Second, we can summarize the interval-based data set to one with a row for each year, calculating the total duration the lake is closed for each year where there are more than one interval.

3.2.3 Loading the Libraries

The first time you load the tidyverse package, you will often see a message similar to this.

── Attaching packages ───────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2     ✓ purrr   0.3.4
✓ tibble  3.0.4     ✓ dplyr   1.0.2
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

This is normal and does not indicate an error. The first part of the message names the packages which were loaded as the core portions of tidyverse with version numbers. (Your version numbers might be higher.)

3.3 Transformed Lake Mendota Data

Following the transformation strategy described above results in two separate data sets, each of which may be written to its own CSV file. The resulting files are lake-mendota-interval-2021.csv with one row per interval when the lake is closed with ice, and lake-mendota-winters-2021.csv with one row per winter.

3.3.1 Lake Mendota Variables

The file lake-mendota-intervals-2021.csv has the following variables.

Variable Description
winter two-year range of the stary and end of the winter
year1 the first year of the winter
year2 the second year of the winter
closed the date of the time interval when the lake first closed with ice
open the subsequent date of the time interval when the lake again opened
duration the duration of time in days the lake is closed in the interval

The variables in lake-mendota-winters-2021.csv include winter, year1, and days, as defined as in the previous file, except that days is the sum of the durations of all intervals when the lake was closed in the winter. Additional variables are:

Variable Description
intervals the number of time intervals the lake is closed
first_freeze the date the lake first closes with ice
last_thaw the date the lake is first opens after the final interval of freezing
decade the decade (1850, 1860, …) of the winter

3.3.2 Plotting Duration Closed versus Time

The second data set contains the variables we wish to explore to examine how the annual total duration that Lake Mendota is closed with ice changes over time. We plot the data with points, add lines to make it easier to follow the path, and add a smooth curve which helps to see the overall pattern, lessening the visual impact of the year-to-year variability.

Effective graphs are a highly efficient way to capture and convey important data summaries. In this graph, the smooth line represents an estimate of the mean duration that Lake Mendota is closed by ice due to the climate, with changes in this curve an estimate of the climate change in this variable. The scatter of points around this smooth curve represent the effects of weather, the noisy deviations of the actual annual data around the more slowly changing trend.

From this graph we can note the following:

  • In the mid 1850s, it was typical for Lake Mendota to be at least 50% covered by ice for over 120 days, or about four months each year.
  • In more recent times, in a typical winter, Lake Mendota is closed by ice for only about 80 days per winter, about a 33 percent decrease from the typical amount in the 1850s, or nearly six weeks less.
  • It appears as if the rate of decrease in the typical total duration time being closed with ice was high in the last half of the 1800s when data collection began until about 1900 where things stabilized until just after 1950 or so, when a steady decrease began again and persists.
  • While there is a clear decline in the total duration of being closed by ice over time, there has always been considerable year-to-year variation. It is not unusual in any particular year for the actual realized duration of being closed by ice to vary from the mean trend by as much as two to three weeks in either direction.
  • A model of this data might include a smooth curve over time, representing the long-term climate behavior, with a random year-by-year process that captures the variability due to the weather in a given year.
  • In such a model, the curve that shows a decrease in typical values is the signal while the annual fluctuations are noise.

3.3.3 Modeling Lake Mendota Data

A statistical model for the duration of time in days that Lake Mendota is at least half covered by ice as a function of time (represented by the first year of the corresponding winter) takes the form \[ y_i = f(x_i) + \varepsilon_i \] where \(i\) is an index for the winter, \(y_i\) is the duration for the \(i\)th winter in days, \(x_i\) is the first year of the \(i\)th winter, \(f(x)\) is the function which represents the expected duration in a given year as a characteristic of the climate, and \(\varepsilon_i\) is a random annual deviation from the trend in the \(i\)th year.

A simple model for \(f\) would be a straight line, but the smooth curve shown in the graph suggests that a straight line may not adequately capture the actual changes in the the trend over this time period. The data may be summarized as an estimate of the mean value of \(y\) due to climate, \(\hat{f}\) and deviations from this mean. Each data point \(y_i\) has a corresponding fitted value, \(\hat{y_i} = \hat{f}(x_i)\), and residual \(\{y_i - \hat{y}_i\}\), the difference between the observed and fitted values. The following table shows data from the first few winters and associated fitted values and residuals.

## # A tibble: 6 × 5
##   winter  year1 duration fitted residuals
##   <chr>   <dbl>    <dbl>  <dbl>     <dbl>
## 1 1855-56  1855      118   125.     -7.00
## 2 1856-57  1856      151   124.     26.6 
## 3 1857-58  1857      121   124.     -2.77
## 4 1858-59  1858       96   123.    -27.2 
## 5 1859-60  1859      110   123.    -12.6 
## 6 1860-61  1860      117   122.     -4.99

Note that the most of the first few observed durations closed by ice are below the trend line, but that the second one is nearly four weeks longer than the trend predicts, as described by the second residual having a value \(+26.6\) compared to the negative value of the other early observations.

3.3.4 Residuals

Whereas the smooth curve describes the estimated trend due to the climate, the residuals reflect the observed deviations from the trend associated with the weather. We can summarize these residuals both graphically and numerically. Understanding the nature of these deviations are very important when fitting and interpreting statistical models as well as more formal analysis leading to statistical inference. The last portion of the course will delve into understanding these ideas.

A density plot is an effective way to show the distribution of a collection of numerical values.

The graph shows an approximate bell-shaped distribution, but, perhaps, with asymmetry with extreme low (negative) residuals being more common than extreme positive residuals.

3.3.4.1 Standard Deviation

In the statistical model, the mean residual is zero with the distribution of observed residuals centered at or near this value. The mean is the sum divided by the number of observations,

\[ \bar{x} = \frac{ \sum_{i=1}^n x_i }{n} \]

A statistic called the sample standard deviation is a summary of the amount of variation in these sampled residual values. The sample standard deviation is (almost) the square root of the average squared deviation of the observations from their sample mean, the “almost” arising by dividing by \(n-1\) instead of \(n\) for reasons explained in a mathematical statistics course, where \(n=166\) is the sample size in the current example.

The formula for the sample standard deviation of a sample \(x_1,\ldots,x_n\) is \[ s = \sqrt{ \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} \] where \(\bar{x}\) represents the sample mean.

The sample mean is near zero, as expected in the model description. The standard deviation value of 16.7 indicates that it is not at all unusual for the actual duration that Lake Mendota is closed with ice to vary from the norm by about two and a half weeks. One use of this measure of variability is in formal statistical methods to quantify uncertainty in estimates such as the rate at which the duration of being closed by ice is changing each year. Another use is quantifying the probability of successful predictions, such as the total duration that Lake Mendota will be closed with ice in a future year or dates when Lake will first freeze or last thaw in the spring.

3.4 The Journey Ahead

The remainder of these course notes will provide detailed descriptions of the principles, methods, and code needed to conduct this analysis and similar ones with several different case studies. Some chapters will describe in detail a single tidyverse package for accomplishing a collection of tasks, such as data visualization or data transformation. Other chapters will introduce new case studies. Later chapters will describe statistical concepts and methods of formal statistical inference.

The student who works through all of the examples in these course notes and masters the concepts and methods will have taken an important step forward in achieving the knowledge and skills of data science.