noaa_revisited2.Rmd

---
title: "Is 2016 the Hottest Year on Record?"
author: |
  | Catharine M. Wyss
  | cathy@datacathy.com
date: "February 17, 2017"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(gridExtra)
```

## Abstract

That our climate is changing seems unequivocal. Data sets such as NOAA's GHCN [[1]] comprise up-to-date records of temperatures from stations all over the world. Using a recent version of the GHCN data, I reproduced a target graph from NOAA's Climate at a Glance webpage showing global land temperature anomalies for the period 1901-2016. To obtain the NOAA graph, I had to use the quality control adjusted (QCA) data and apply two non-trivial transformations. First, stations were grouped into 5x5 grid boxes and an average of these stations was taken to represent each grid box (gridding). Second, station observations were recast with respect to a baseline obtained from the 1961-1990 period (baselining). Both gridding and baselining employed seemingly arbitrary parameterization. In contrast, I modify NOAA's techniques and employ an adaptive climate-based gridding and a station-relative baselining, approaches I believe to be more defensible as "natural" transformations. In addition, the final result respects the original QCA data while retaining key aspects of the NOAA analysis. A warming trend is clearly indicated, in spite of a late 20th/early 21st century slowdown. 2016 is the warmest year after the transformations, illustrating the effects of the larger climate change.


## Previously...

I set out to reproduce a graph (next page), from the National Oceanic and Atmospheric Administration's (NOAA's) website [(climate at a glance)][1].

Fortunately, NOAA makes their data available [(ftp to get GHCN data)][3] [[1]]. My first foray into climatology used the "quality control unadjusted" (QCU, or raw) data. All I did was remove the observations that had been marked as bogus in the data. I thought that a minimum of processing would better reflect what the actual data had to say. Note: all temperatures are reported in degrees Celsius ($^{\circ}$C).

Each observation in the data is a list of monthly average temperatures for a given year and weather station. I averaged over these to come up with yearly averages for each station. Then, I averaged over these for each year to come up with overall yearly averages. So, the temperatures reported are actually averages, of averages, of averages. Since the NOAA figure plots not temperatures, but "anomalies" (which are differences from a pre-determined baseline), I did the same. [On their site][4], they seem to indicate their baseline is the 1901-2000 average temperature, which I calculated myself from the QCU data (I was unable to match their value). To reproduce NOAA's chart, I came up with my own graph (next page).

Things to note:

* In my graph, the "pause" in global warming that was said to occur at the beginning of the 21st century is clear. In NOAA's, it is gone.
* In my graph, 1998 is the hottest year on record. In NOAA's, it is 2016.

I was confused by this. Why was my graph so different? I read that NOAA applied some transformations to the data, but surely their transformations would preserve the signals in the original data, right? I sent a preliminary document and an email to NOAA and asked what went wrong.

Let me be clear from the outset: the warming trend is clear in all my investigations; I am not debating this. What I am concerned about is the fact the pause disappeared, as well as the fact the magnitude of recent temperature increases seems to have been exagerated with respect to the original data.

![Global Land Temperature Anomalies According to NOAA, Created January 2017.][2]
![][5]


## NOAA's Response

One of the scientists at NOAA forwarded along my email and eventually it arrived at the desk of the Chief of their Monitoring Branch, who responded with a detailed critique. The tl;dr is that they performed four transformations on the data whereas I didn't (by design).

* **The Grid Transformation.** Since stations are not uniformly distributed over the globe, a 5x5 grid was created and station observations within each box were averaged to come up with a single value representing that box. The justification given for this was that it mitigates the effect of oversampling/overweighting the Europe and USA (i.e. denser) data.

* **The Anomaly Transformation.** "When using the anomaly method, we first subtract a given month's observed value at a station from a baseline average (sometimes called 'normal'). We use the period 1961-90 to establish that baseline, in order to maximize the number of these 'normals' available... Using anomalies is more robust because it is not as prone to wild month to month swings... Temperature anomalies are also more strongly spatially correlated..."

* **The QCA Transformation.** 
"The 'QCA' files have been adjusted to account for known inhomogeneities in the station records." These include pairwise-homogenity adjustments to account for changes in time of observation, instrumentation, location, or environment.

* **The Merged Dataset.**
NOAA uses the GHCN data merged with an ocean temperature data set, "which has methods to help determine temperature anomalies in some data sparse regions near the oceans."

I decided to look into the data transformations he was referring to and see what I could reproduce. I didn't attempt to merge the GHCN data with another data set giving ocean temperatures, restricting my attention to their land results.

## What's the Difference?

What, exactly, is the difference between what I did previously and what NOAA did? Consider the three transformations.

* **No Gridding.** I included all the stations in the QCU data when averaging to find yearly temperatures. I did not account for density of measurements with respect to spatial extent.

* **A Difference Transformation.** NOAA uses what I'll call the "anomaly transformation method." In this method, each station's measurements are subtracted from a baseline calculated from that station's data. NOAA chose the period 1961-1990 for their baseline, and there are consequences to this choice. Instead, I used what I'll call the "difference transformation method." In my method, the yearly temperatures were differenced with a baseline calculated from the cumulative data. I used the period 1901-2000 (the 20th century) as my baseline. The main change is whether you baseline first, then average (NOAA), or average first, then baseline (me).

* **The QCU Data.** NOAA has painstakingly quality control adjusted the QCA data, accounting for things like if a station changes instrumentation, location, or environment. I had used the quality control unadjusted (QCU) data.

Now, I'll take each of these one by one and incorporate them into a more refined analysis I can really get behind.


## The QCA Transformation

Basically, other than removing observations known to be faulty, what the quality control adjustment seems to be is the process of automatically detecting and eliminating "changepoints" in the temperature data. One big question is what method was actually used (I was directed to [a general paper][6] [[2]] describing the process, not an actual detailed implementation of NOAA's changes). The other big question is that the methods for detecting changepoints rely on "reference" series, and it would be interesting to know what was used for the NOAA data. I can imagine it might have been a process of homogenizing a station's observations with respect to near neighbors, but this would only work for densely situated stations. Without knowing what was actually done, I can't speculate on the impact. 

I decided to play the game and use the QCA data as instructed. I calculated the averages and plotted the same graph I had done for the QCU (unadjusted) data. The plot uses my difference method to represent variation from the baseline. My baseline period is the 20th century (1901 through 2000).

The pause is still there, only it is less obvious since there are now temperature increases for 2000 through 2011 whereas before they were smaller (or negative). What caused this? I suspect it is the choice of the baseline period. (More on that in the anomaly transformation section, below.)

1998 is clearly still the hottest year, by a fair measure. In other words, the quality control adjustment alone isn't responsible for NOAA's results.

\vspace*{1cm}

```{r, cache=TRUE}
# read in QCA data
data <- read.csv('tavg_qca_final.csv')

# find yearly averages for each station
data <- mutate(data, 
  MEAN = (JAN + FEB + MAR + APR + MAY + JUN + JUL + AUG + SEP + OCT + NOV + DEC)/12)

# calculate baseline
baseline <- mean(filter(data, (YEAR >= 1901) & (YEAR <= 2000))$MEAN)

# calculate average temperature by year, then plot differences with baseline
yrly_avgs <- data %>% group_by(YEAR) %>% summarize(AVG = mean(MEAN))
yrly_avgs %>% mutate(DIFF = AVG - baseline) %>%
  ggplot(aes(x = YEAR, y = DIFF)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Yearly Global Land Temperatures, QCA Data", 
       x = "Year", 
       y = "Difference from 1901-2000 Avg, °C")
```

## The Grid Transformation

How much of a problem is the heterogeneous distribution of stations over the globe? It is obvious from the map (next page) that stations are heavily weighted toward the US, Western Europe, Australia, and (to a lesser extent) China. The map also shows the 5x5 grid which probably formed the basis of NOAA's averaging.

![Worldwide Weather Stations with 5x5 Grid][14]

To reproduce the effect of NOAA's grid transformation, I first created a global 5x5 grid in [QGis][7]. I exported it as a shapefile and used Python's [Basemap package][8] to translate the polygons into [WKT][9] (Well-Known Text) format so I could easily import them into [PostgreSQL][10]. I then loaded the station locations (in latitude and longitude) and matched them up with grid boxes. I do all my GIS processing in PostgreSQL because it is fast and easy with the database extension [PostGIS][11].

\vspace*{1cm}

```{sql, eval=FALSE}
-- find grid box for each station
update stations as s set gridid = g.gridid 
from grid as g 
where ST_Contains(g.poly_geom, s.point_geom); -- PostGIS geometric containment function
```

\vspace*{1cm}

After determining which grid box each station was in, I averaged temperatures over grid boxes, imported the result into R, and graphed the difference with the 1901-2000 period baseline (note I recalculated the baseline according to the new grid-based temperatures).

\vspace*{1cm}

```{sql, eval=FALSE}
-- average temperatures over grid boxes
insert into yrly_grid_avgs 
(select tmp.year, avg(tmp.mean) 
from (select t.year as year, s.gridid as box, avg(t.mean) as mean 
      from tavg as t, stations as s 
      where t.stnid = s.stnid 
      group by t.year, s.gridid) as tmp 
group by tmp.year);
```

\vspace*{0.5cm}

```{r}
# read yearly gridded averages
yrly_grid_avgs <- read.csv('yrly_grid_avgs.csv')

# calculate baseline
baseline_grid <- mean(filter(yrly_grid_avgs, (year >= 1901) & (year <= 2000))$mean)

# graph difference with baseline
yrly_grid_avgs %>% mutate(diff = mean - baseline_grid) %>%
  ggplot(aes(x = year, y = diff)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Yearly Global Land Temperatures, 5x5 Global Grid", 
       x = "Year", 
       y = "Difference from 1901-2000 Avg, °C")
```
Interesting. The magnitudes of temperature increases have risen. Some years have flipped from increasing to decreasing or vice-versa. Between 1900 and 1950 there were previously decreases, whereas these are now largely increases. The pause is further disturbed, as magnitudes of increases have dramatically risen in the early 21st century period (save for stubborn decreases in 2001, 2003, 2010, and 2011). More startlingly, the warming trend has diminished greatly, from just over 1$^{\circ}$C throughout the 20th century to less than 0.5$^{\circ}$C.

What happened? The grid transformation had the effect not only of mitigating oversampling, but of amplifying undersampling. In other words, the contribution of one lone station in the middle of Antartica is now as much as all of the USA stations in one 5x5 box. Any small increase (or decrease) in the observed temperatures at such isolated stations is now strongly reflected. Is this reasonable? Is it truer to actuality? The "truth" is probably somewhere in between. The choice of 5x5 grid boxes seems arbitrary. Why not 3x3 or 10x10? Or 2x2? The smaller the grid box, the less overweighting of undersamples and the more overweighting of oversamples. The larger the grid box, the less overweighting of oversamples but the more overweighting of undersamples.

At the equator, there are roughly 70 miles distance per degree (latitude or longitude). That means a grid box 5 degrees wide by 5 degrees long encompases  122,500 square miles! It seems like that is too large a distance for one measurement. Even 70 miles seems too large. 70 miles is the distance from Reno (semi-arid climate, average yearly low 4$^{\circ}$C, average yearly high 20$^{\circ}$) through the Donner Pass (7,056 ft above sea level, continental climate, average yearly low -2$^{\circ}$C, average yearly high 15$^{\circ}$C) to Dutch Flat (mediteranean climate, average yearly low 10$^{\circ}$C, average yearly high 23$^{\circ}$C).

I get what NOAA is saying. We don't have temperature sensors laid out across the globe at 10 mile intervals, and the situation was worse historically. They don't want to unduly preference high-density areas. But high-density, modernized sensors are going to be more accurate, so surely if you preference any measurements, you would want to preference these as opposed to isolated, uncertain observations? I tried a 2x2 grid and the resulting graph looked much more similar to the base QCA data. I tried a 10x10 grid and the changes in the 5x5 grid became much more pronounced.

In general, it would be better to use an adaptive approach. In places like Rocky Mountain passes where there is a wide variety of climate in a small area, a smaller griding scheme should be used. In vast deserts like Antartica where the climate is relatively uniform, the grid boxes could be larger. The climates of Earth are relatively well-studied (and delineated) and could provide a reference for an adaptive gridding approach. I decided to go ahead and do this. I obtained a shapefile of the Koeppen-Geiger climate zones from [scientists at the University of Vienna][15]. Each climate zone is a collection of polygons. Note that where climates are close together, there are many small polygons, whereas for areas that a single climate dominates, there is a single large polygon. This is effectively "adaptive gridding."

![World Climate Zones, Koeppen-Geiger Demarcation][16]

```{sql, eval=FALSE}
-- first load climate zones and create geometric objects
create table climates (gridid integer, poly varchar(10000), zone_numcode integer, 
                       zone_code varchar(5), climate varchar(25), precip varchar(25), 
                       temp varchar(25), poly_geom geometry);
\copy climates(gridid, poly, zone_numcode, zone_code, climate, precip, temp) 
      from 'climate_zones.csv' csv header;
update climates set poly_geom = ST_GeomFromText(poly);

-- next, find zone containing stations
alter table stations add column climate_gridid integer;
update stations set climate_gridid = c.gridid 
from climates as c 
where ST_Contains(c.poly_geom, stations.point_geom);

-- average temperatures over climate grid and export for graphing
create table yrly_climate_avgs (year integer, mean float8);
insert into yrly_climate_avgs 
  (select tmp.year, avg(tmp.mean) 
   from (select t.year as year, s.climate_gridid as box, avg(t.mean) as mean 
         from tavg as t, stations as s 
         where t.stnid = s.stnid 
         group by t.year, s.climate_gridid) as tmp 
   group by tmp.year);
\copy yrly_climate_avgs to 'yrly_climate_avgs.csv' csv header;
```

```{r, message=FALSE}
yrly_climate_avgs <- read.csv('yrly_climate_avgs.csv')
baseline_climate <- mean(filter(yrly_climate_avgs, (year >= 1901) & (year <= 2000))$mean)
yrly_climate_avgs %>% mutate(diff = mean - baseline_climate) %>%
  ggplot(aes(x = year, y = diff)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Yearly Global Temperatures,\nAdaptive Climate-Based Gridding", 
       x = "Year", 
       y = "Difference from 1901-2000 Avg, °C")
```

I believe this gives the best of both worlds. Gentle climates are more heavily populated (thus more densely monitored) but also tend to be more interspersed, so the grid boxes in those areas are small. Extreme climates are less populated (so more sparsely monitored) but also tend to be fairly uniform over wide areas, so the grid boxes in those areas are large. The resulting graph seems to be a good compromise between preserving the signals in the original QCA data yet averaging over densely packed stations, which was NOAA's stated goal.


## The Anomaly Transformation

To reproduce the effect of NOAA's anomaly transformation, I calculated a yearly baseline from the years 1961-1990 for each station. The fact the baseline is computed on a station-by-station basis is what distinguishes the approach from the difference method I have applied in the figures up to this point. (NOAA seems to have done this on a monthly basis whereas I do it on a yearly basis. It will amount to roughly the same thing.) 

Note that any stations which do not appear during the 1961-1990 period are ignored. This means, for example, that stations added after 1990 are not included in the analysis. This seems incredible, but I do not see how NOAA could be including these while using a 1961-1990 baseline.

```{sql, eval=FALSE}
-- first, add column baseline to tavg that will contain the baseline for each station
alter table tavg add column baseline float8;

-- then create tmp relation and fill it with the baselines for each station in tavg
-- note that this will ignore stations not appearing in 1961-1990
create table tmp (stnid float8, baseline float8, cnt integer);
insert into tmp (select stnid, avg(mean), count(*) 
                 from tavg 
                 where year >= 1961 and 
                       year <= 1990 
                 group by stnid);

-- out of curiosity, I explored the number of stations the baseline is based on
-- mean: 20
select avg(cnt) from tmp;
-- std dev: 8 
select stddev(cnt) from tmp;
-- min: 1
select min(cnt) from tmp;
-- max: 52
select max(cnt) from tmp;

-- also, how many stations will now contribute to the yearly averages?
-- count: 6719/7280
select count(*) from tmp;

-- now transfer those baselines back to tavg
update tavg set baseline = tmp.baseline from tmp where tmp.stnid = tavg.stnid;

-- now incorporate baseline into column "anom"
-- note that after this anom has 9,206 null values (corresponding to null baselines)
-- so that's how many observations we're ignoring by doing this transformation
-- not a lot compared to the original 328,182 observations
alter table tavg add column anom float8;
update tavg set anom = mean - baseline;
```

I exported the yearly averages calculated using the new "anom" figures, then loaded this file into R and graphed the results. Note that my yearly averages in this figure are _not_ averaged over grid boxes; rather this figure reflects only the effect of the anomaly transformation on the QCA data.

```{r}
yrly_anoms <- read.csv('yrly_anoms.csv')
ggplot(yrly_anoms, aes(x = year, y = anom)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Yearly Global Land Temperature Anomalies, 1961-1990 baseline", 
       x = "Year", 
       y = "Anomaly")
```
Interesting. The warming trend is back, and a bit greater in magnitude than before (about 1.5$^{\circ}$C versus 1$^{\circ}$C for the QCA data alone). The pause is stubbornly hanging on, but has been greatly reduced by increases in positive values. In fact, if a person saw only this graph, they might not think there was a pause at all, only a couple of years of reduced warming. Also, the temperatures have now somehow flipped, so that 1998 is no longer the warmest year, 2016 is. How did this happen?

It turns out the result we get from the anomaly transformation greatly (obviously) depends on the range chosen as the baseline. NOAA chose 1961-1990. Why? Ostensibly, "in order to maximize the number of these 'normals' available." Is that really the case? Here is a graph of the number of stations reporting for each year from 1901 through 2016.

```{r}
data %>% group_by(YEAR) %>% summarize(numstns = n())  %>%
  ggplot(aes(x=YEAR, y=numstns)) +
  geom_bar(stat="identity", color = "#377eb8") +
  geom_hline(yintercept = 4000, color="#e41a1c") +
  labs(title="Number of Stations in GHCN QCA Data, by Year", x="Year", y="Number of Stations")
```
The average number of stations reporting is 2928 with a standard deviation of 1144. A "natural" cutoff point seems to be Number of Stations > 4000. This does indeed delineate a 30-year period, but it is _not_ 1961-1990. It is 1954-1983. What happens when this period is chosen as the baseline period instead? The result is very similar (nearly identical) to the previous graph. So why did NOAA choose 1961-1990? I can't answer that, but I can show what happens if other baseline periods are chosen. One would think that choosing more recent baselines would be a good idea, considering that more recent observations are more likely to be correct and precise. But this can give dramatically different results.

```{r}
yrly_anom_baselines <- read.csv('yrly_anom_baselines4.csv')
p1 <- ggplot(yrly_anom_baselines, aes(x = year, y = anoma)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Baseline 1901-2000", 
       x = "Year", 
       y = "Anomaly") + ylim(-2,2)
p2 <- ggplot(yrly_anom_baselines, aes(x = year, y = anomb)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Baseline 1991-2000", 
       x = "Year", 
       y = "Anomaly") + ylim(-2,2)
p3 <- ggplot(yrly_anom_baselines, aes(x = year, y = anomc)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Baseline 1981-2010", 
       x = "Year", 
       y = "Anomaly") + ylim(-2,2)
p4 <- ggplot(yrly_anom_baselines, aes(x = year, y = anomd)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Baseline 2001-2010", 
       x = "Year", 
       y = "Anomaly") + ylim(-2,2)
grid.arrange(p1,p2,p3,p4, nrow=2, ncol=2)
```
Generally these figures seem increasingly less dire than NOAA's; the rise in anomaly almost seems "natural" in the last graph since most of the 20th century was colder than the baseline. The fact I return to which is at the heart of my critique: the shape of the resulting graph greatly depends on the choice of baseline period. Thus, this baseline period needs to be acutely justified, beyond "more stations were involved." Leaving out stations not appearing during the baseline period seems unsatisfying. 

If I had to choose a baseline period, I would choose an adaptive strategy whereby each station's baseline is the average of its observations over the time period in which it's defined. The assumption is that a station's measurements are simply varying around a mean which evolves as time goes by. This approach would include all stations while at the same time dampening volatility and maintaining spatial correlation.

```{r}
yrly_anom_varying <- read.csv('yrly_anom_varying.csv')
ggplot(yrly_anom_varying, aes(x = year, y = anom)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Yearly Global Land Temperature Anomalies,\nBaseline Varies with Station", 
       x = "Year", 
       y = "Anomaly")
```

It is clearer to me what is going on in this figure than when using the fixed baseline approach. 1998 is no longer the hottest year because the stations that recorded higher temperatures tended to run hot anyway. In contrast, the 2016 value is more reflective of an actual anomaly in the linguistic sense of the term: numerous stations recorded temperatures genuinely higher than their usual recordings (i.e. anomalous temperatures). The warming pause is more of a "slowdown" instead of a pause, because the temperature recordings behind it, while higher, were closer to their usual values. Using station-by-station baselines seems more natural than fixing an artificial period such as 1961-1990 or 1954-1983 which may itself be warmer or colder, thus confounding the interpretation.


## Reproducing NOAA's Figure

By combining the 1961-1990 anomaly transformation and the 5x5 grid transformation, I obtained the following plot. (NOAA's plot is repeated for comparison.)

```{r}
yrly_grid_anoms <- read.csv('yrly_grid_anoms.csv')
ggplot(yrly_grid_anoms, aes(x = year, y = anom)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Average Yearly Temperatures, 5x5 grid and 1961-1990 baseline", 
       x = "Year", 
       y = "Anomaly")
```

![Graph from NOAA's Climate at a Glance Page][2]

\newpage
The two plots look almost identical. At least I was able to reproduce what NOAA did. Is it reasonable? I personally feel like neither the 5x5 grid transformation nor the 1961-1990 anomaly transformation is what I would have chosen. The end result speaks for itself. The warming pause has completely disappeared, and the relative warmth of the late 90s has also disappeared. The plot shows a uniform rise in the anomaly from at least 2000 onward. The warming trend shown is anywhere from 1.1$^{\circ}$C to 1.3$^{\circ}$C per century. Important signals from the original data have been all but lost.


## So is 2016 the Hottest Year on Record? Was there a pause? How much do we panic?

I have argued for an approach that uses adaptive climate gridding and varying baselines for anomalies. What does the composite graph look like?

```{r}
yrly_climate_anoms <- read.csv('yrly_climate_anoms.csv')
ggplot(yrly_climate_anoms, aes(x = year, y = anom)) +
  geom_bar(stat = "identity", color = "#377eb8") +
  geom_smooth(method = "lm", se = TRUE, color = "#e41a1c") +
  labs(title = "Yearly Global Land Temperatures,\nClimate Gridding and Variable Baselines", 
       x = "Year", 
       y = "Anomaly")
```

The graph is not as unequivocal as NOAA's, yet has similar features. 2016 is indeed the hottest year according to the transformed data. The pause is evident during the late 20th/early 21st centuries, terminating in 3 years of resumed warming. The line indicated is a linear model for the data. A summary of that model contains more precise information.

```{r}
anom_model <- lm(anom ~ year, data = yrly_climate_anoms)
summary(anom_model)
```

The model has high significance (very low p-value) and explains about 63% of the variability in temperature in the data (R-squared = 0.6291). From the slope coefficient and standard error we can compute a confidence interval for the warming. I'll assume the standard 95% confidence interval, for which the critical t-score is 1.98 for our 114 degrees of freedom.
$$
\begin{split}
\text{CI} &= \text{slope} \pm t_{df}^* \times \text{SE}\\
&= 0.0093 \pm 1.98 \times 0.0006688\\
&= 0.0093 \pm 0.001324\\
&= (0.007976, 0.01062)
\end{split}
$$
All this simply says that I am 95% confident, based on the data, that it shows warming between $0.7^{\circ}$C and $1.1^{\circ}$C per century. If things continue in this fashion (always a big "if"), we will see around $1^{\circ}$C of global warming in the next 100 years.

How bad is that $1^{\circ}$ going to be? It depends on whom you ask. There are plenty of [scaremongerers][17] out there, predicting a mere $1^{\circ}$ will topple governments, raise ocean levels, and generally usher in the apocalypse. Keep in mind, though, that there is research suggesting our climate has warmed over $4^{\circ}$ in the past 400 years, and is still not at the peak of its B.C. warmth. The world looks different than it did in 1100 B.C., but the apocaplypse hasn't come (or if it has, factors other than temperature were much more at fault). True, deserts are shifting, oceans are rising, plates are moving, volcanoes are erupting. How worried should we be?

I'm convinced (and this hasn't changed during my study of NOAA's data) that we should be exactly as panicked as we need to be to ensure continued innovation and advancement in enviroment-mitigating technologies. We need earlier warning systems, stronger sea walls, better floodplain habitation policies, and so on. Individuals should be worried enough to conserve and lobby for sensible environmental protections. It simply doesn't make sense to fill the atmosphere with carbon emissions, any more than it made sense to fill the air over our cities with particulate matter. Policy does make a difference to quality of life.

On the other hand, blindly accepting an established line of thinking is unacceptable (my opinions with respect to this are even stronger after analysing NOAA's data). Sweeping government intervention is both unnecessary and undesirable. Imagine if, during the cooling of the 70s, we had reacted by instituting policies of increased greenhouse gases to counteract it. No, innovation and conservation should be largely decentralized and voluntary, based on education and ability. It is the government's job to facilitate such efforts, not entomb a (by nature) uncertain doctrine. 

Thanks for reading.

**Note:** All code and data employed in this writeup is available at [https://github.com/datacathy/climate][18].

## References

[[1]] Jay H. Lawrimore, Matthew J. Menne, Byron E. Gleason, Claude N. Williams, David B. Wuertz, Russell S. Vose, and Jared Rennie (2011): Global Historical Climatology Network - Monthly (GHCN-M), Version 3. Average Temperatures. NOAA National Centers for Environmental Information. doi:10.7289/V5X34VDR 01/22/2017.

[[2]] Karl, T.K. C.N. Williams, P.J. Young and W.M. Wendland (1986), A Model to Estimate the Time of Observation Bias Associated with Monthly Mean Maximum, Minimum and Mean Temperatures for the United States. Journal of Climatology and Applied Meteorology, 25 (2), 145-160.

[[3]] Smith, Thomas M. and Richard W. Reynolds (2005), A Global Merged Land-Air-Sea Surface Temperature Reconstruction Based on Historical Observations (1880-1997). Journal of Climate, 15 June 2005.


[1]: https://www.ncdc.noaa.gov/cag/time-series/global/globe/land/ytd/12/1880-2016?trend=true&trend_base=10&firsttrendyear=1901&lasttrendyear=2016
[2]: multigraph_NOAAland.png
[3]: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/v3/
[4]: https://www.ncdc.noaa.gov/sotc/global/201613) (https://www.ncdc.noaa.gov/sotc/global/201613
[5]: qcu_analysis.png
[6]: http://journals.ametsoc.org/doi/pdf/10.1175/JCLI3524.1
[7]: http://www.qgis.org/
[8]: http://matplotlib.org/basemap/
[9]: https://en.wikipedia.org/wiki/Well-known_text
[10]: https://www.postgresql.org/
[11]: http://www.postgis.net/
[12]: yrly_anoms.csv
[13]: ClimateMap_World.png
[14]: station_locations_kav7.png
[15]: http://koeppen-geiger.vu-wien.ac.at/shifts.htm
[16]: climate_grid_vibrant.png
[17]: http://globalwarming.berrens.nl/globalwarming.htm
[18]: https://github.com/datacathy/climate