Classify BC BGC zones by Koppen-Geiger

We're using a Jupyter Notebook with Python (programming language) and Pandas (data science framework) to classify British Columbia's BioGeoClimatic zones by the Koppen-Geiger global classification system.

This should help understand which global species may be interested in invading specific areas of BC.

We'll start by importing Pandas (to be honest, we're barely using it; we could just as well use the Python CSV library for everything we're doing here, but in the future some of the power of Pandas might come in handy).

Pull in the data

Let's open the BioGeoClimatic zone data as a Pandas DataFrame object. The data has been rescued from Microsoft Access via a table export and is in the working directory as a comma-delimited text file called BGC_Units.txt.

The explanations (helpstrings) of the cryptic column names are in a CSV file called tblTasksFields.csv (which needed a bit of extra massaging after rescuing from MSAccess to remove non-compliant Unicode degree symbols).

Let's make a dictionary with the column names as the keys and the description helpstrings as the values. We've manually identified column 2 as containing the zone/subzone/variant id string and column 5 as containing the explanatory helpstring.

Let's look at all of the column headers, the type of each column (Pandas infers a datatype for an entire column based on the values it finds in them), and the explanatory helpstrings.

Ok, the first column is an incrementing ID, the following three columns are objects (well, strings). The remaining columns are all numbers, mostly floating-point decimals with a few integers here and there. Note that many seem to consist of a string like Tmax followed by a number between 01 and 12; these represent months.

Let's grab the three string columns and see how many unique values they contain, and what they are. From the previous cell's output, we know that the first is a zone/subzone/variant ID string, the second a range from a start to end year, and the third describes the statistical operation that's generated the actual values in the row.

A Python set will allow us to look at only the unique values, so we can see all of the different zones, date ranges, and statistical operations that the values come from.

Let's look at all of the unique date ranges, statistical outputs, and zone/subzone/variant identifiers in the data:

Right! For the moment, we only want to consider the date range from 1961 to 1990, since we're comparing our results to previous work that used that range.

Furthermore, we only want to use the mean values to key out the Koppen-Geiger classes of the zone/subzone/variants. Someone smarter than us may later wish to do a more sophisticated classification, but not today.

So we're only going to using a small subset of the data to classify the zone/subzone/variants. Let's create that subset now with a filter.

Nice! Not only is this a much smaller dataset, we can see that it has the same number of rows as there are unique zone/subzone/variants. That's a good sanity check on the filter as well as a bit of reassurance regarding the quality of the dataset.

Using Koppen-Geiger key to build a classifier

Let's build a function that accepts a row from the dataset as input, and returns a Koppen-Geiger class string if it matches the parameters, or None if not.

First we'll need some helper functions!

Some helper functions

Most of the Koppen-Geiger criteria involve finding minimum and maximum temperature and/or precipitation values across multiple months. For example, the average temperature is found in the dataset in columns with headers labelled Tave01 Tave02...Tave12 for January through December. So we'll create a helper function (twelvemonther) to generate those column names so we can use them as keys to extract the appropriate series.

Additionally, the algorithm often wants to compare winter and/or summer maxima and minima (with summer in the Northern Hemisphere defined as April-September and winter as October-March. So we'll create two more helper functions (summer and winter) to generate those column names.

First, second, and third letter functions

Now we'll create some functions for each of the three letters in the Koppen-Geiger system. The rug that ties the room together will come after we sort out the individual letters based on the rules in the key algorithm (which we've copied into comments in the code).

The first letter

The second letter

The third letter

At long last: the actual classification function!

Now that we have the helpers and the functions to return the three different letters, let's put it together.

We start by using our helper functions to grab 12-month series of temperature and precipitation data, as well as the yearly average temp and precip. Then we call the letter functions using the "monthlies" and "yearlies" as input.

Ok, let's iterate through all of the filtered zone/subzone/variants and see what our function spits out. We'll call it c1990 for "classification 1990", but we'll also keep a list of the entire output of the classification function, which includes temperature and precipitation data, in another variable called allout1190. We'll print out some of the classification, but only the first 15 lines to avoid many pages of output.

Map it and see if it matches other maps

Now we can add the Koppen-Geiger classification strings as attributes to a GIS file and create a categorized map.

We can compare that map to others generated on a global scale using the K-G categories; we would expect that they should roughly match, but the categorization generated from the BGC data should be much higher resolution and more accurate as it's derived from much more detailed data than the global dataset could possibly be.

First let's export a Comma Separated Values file (.csv) with the BGC labels in one column and the K-G classifications in the other.

Now we cheat, and load that CSV file into QGIS along with the BEC_BIOGEOCLIMATIC_POLY file from the BC government GIS portal (it would be better to do the mapping right here in the Jupyter notebook to make the whole project more self-contained, but it's faster to start using QGIS).

The next two pages are maps exported from QGIS with the Koppen-Geiger classifications color-coded (randomly; we should probably try to match the colors to some global K-G maps, but for now we just want to see the results to check if they make sense at first blush).

first map

More recent climate data

Now that we've built all of the scaffolding to classify BGC zone/subzone/variants by Koppen-Geiger, we can experiment easily. What would happen if we used the most recent data from ClimateBC, the series from 1990-2014? Let's find out.

Same number of rows. Good news there; we don't want to be classifying zones by an incomplete set of climate numbers.

Ok, let's run this series through our classification function. Any differences? Let's create a comparison and look at them side-by-side (again only 15 lines).

Yes, there are a few differences! Let's isolate and count them.

And map the 1990-2014 data, as well as the differences.

second map

Detailed outputs for quality checking

More maps

Now let's map those!

comparison map

detailed map