Boxplots based on raster data in GRASS GIS
Introducing three GRASS GIS addons
Introduction
Boxplots are a great way to get a quick overview of the distribution of your data, to compare the distributions of different data sets, and to identify outliers in your dataset. They are easily created using tabular data as an input. In GRASS GIS for example, you can use the d.vect.colbp addon for that. But what about raster data as input? Wouldn’t it be nice to be able to create raster based boxplots directly in GRASS?
Working on a project1, I needed a quick way to compare the environmental conditions within and between protected areas of the Terai Arc Landscape in Nepal and India (TAL; Figure 1). In addition, I wanted to visualize how environmental conditions varied in time and space within the whole region and within each of the protected areas.
Since environmental data come in raster format, I thought I might as well create an addon that would allow me to create raster based boxplots in GRASS GIS. I actually ended up creating three addons; r.boxplot, r.series.boxplot and t.rast.boxplot. With the first one, you can draw the boxplot of raster values, and optionally pass a zonal raster layer to get per class boxplots. The latter two allow you to draw boxplots of a series of raster layers. The r.series.boxplot addon expects a user-defined list of input rasters. The t.rast.boxplot is meant to visualize time series and requires a space-time raster dataset as input.
r.boxplot
In this post, we will be looking at how the leaf area index (LAI) varies in and between the protected areas of the Terai Arc Landscape (TAL) in Nepal and India. We will also look how the LAI varies both in space and time within the whole TAL region, and within Bardia National Park (the focus of the aforementioned project).
The examples below use the 300 m resolution 10-day averaged LAI raster layers for the period 2014-2021. This data is available on the Copernicus Global Land Service2. The vector layer TALregion, which contains the boundary of the TAL region, and the vector layer TAL_PAS3 with the boundaries of 15 protected areas in the region.
For all examples, I include the code for the command line () as well as the corresponding Python () and R () code. Note that to run the R code, you need to start R from de GRASS GIS console, and the rgrass package needs to be installed.
Example 1
To see how the LAI in the period 1-10 January 2014 was distributed across the different protected areas in the TAL region, we can use the r.boxplot
addon. The addon respects the region settings and, if set, the mask. So first, we make sure the region matches the extent of the TAL region and the resolution of the LAI raster layers. And we create a MASK to ensure only the raster cells that fall within the boundaries of the TAL region are used to create the boxplots4.
Set the region and mask to match the extent and boundary of respectively the TAL region. Set the resolution to match the LAI layers.
To create a figure with a boxplots for each of the protected areas, we first create a zonal map based on the vector layer TAL_PAS, using the values in the column cat as raster ID and the values in the column name as label. Note that the extent and resolution will follow the region settings we defined in the previous step.
Convert the vector layer to raster layer, using the name columns to define the category labels, and the cat columns for the category IDs.
Now, we use r.boxplot
to create the figure with the boxplots. We include the outliers using the -o flag and the median and interquartile range (IQR) using raster_statistics=median,IQR. By using bx_sort=ascending, the boxplots will be sorted in ascending order based on their medians.
Create the boxplots, sorting the protected areas in ascending order, based on their median values. Include the interquartile range and median of the whole region.
The resulting boxplot shows the distribution of the average LAI values for the first 10 days of January 2014 in each of the protected areas. The distribution of LAI values clearly vary between parks, which may be related to climatic or altitude gradients, but also to the ratio of grassland and forest. To only compare grasslands, we would need to mask out the other land cover types. I’ll leave that up as an exercise for later.
The figure suggests that overall, the LAI in the protected areas is higher than outside the parks. This could be further tested by comparing the LAI in and around a specific park. But I’ll leave that to you to try.
Note, to learn more about the format and layout options of r.boxplot, check the manual page
Example 2
What if we are not interested in comparing the LAI values inside and outside the parks, but we are interested in comparing the conditions in each individual park with conditions across all the parks? In that case, we can replace the mask based on the whole region by a MASK based on the vector layer TAL_PAL. This will mask out all areas outside the protected areas.
Mask out all areas outside the protected areas by setting mask using the TAL_PAS vector layer as input.
Create the boxplots, sorting the protected areas in ascending order, based on their median values. Include the interquartile range and median of the whole region.
Plotting the IQR of all the parks together makes it easier to single out the parks that have relatively high or low LAI values. Comparing Figure 3 with Figure 2 shows that the LAI in the parks is higher than outside the parks (2 and 1,5 respectively). This reflects, I presume, the higher forest/tree cover within the parks (I would have expected the difference to be larger).
Example 3
To emphasize how, e.g., Bardia National Park compares to the other parks, we can give the boxplot of Bardia a different colour. To do so, we use the option to colour the boxplots according to the corresponding categories of the zonal raster.
First step is to create the colour rules. Next, we use this as input in the r.colors function to define the colours of the protected areas in the ProtectedAreas raster layer. We set the colour of Bardia (ID=5) to green, and the colour of the other parks to light green5.
Define the colour rules (dark green to Bardia National Park, and light green to the other parks) and implement these colour rules using r.colors.
# Create text file with colour rules
cat > rules.txt << EOF
1 184:220:184
2 184:220:184
3 184:220:184
4 184:220:184
5 23:106:00
6 184:220:184
7 184:220:184
8 184:220:184
9 184:220:184
10 184:220:184
11 184:220:184
12 184:220:184
13 184:220:184
14 184:220:184
15 184:220:184
EOF
# Define the colours of the protected areas
# using the colour rule text file as input
r.colors map=ProtectedAreas rules=rules.txt
# Remove the text file with color rules
rm rules.txt
# Create color rules
color_rules = """ \
1 184:220:184
2 184:220:184
3 184:220:184
4 184:220:184
5 23:106:00
6 184:220:184
7 184:220:184
8 184:220:184
9 184:220:184
10 184:220:184
11 184:220:184
12 184:220:184
13 184:220:184
14 184:220:184
15 184:220:184
nv 255:255:255
default 255:255:255
"""
# Define the colours of the protected areas
Module("r.colors", map="ProtectedAreas",
rules="-", stdin_=color_rules)
# Create color rules
color_rules = c("1 184:220:184",
"2 184:220:184", "3 184:220:184",
"4 184:220:184", "5 23:106:00",
"6 184:220:184", "7 184:220:184",
"8 184:220:184", "9 184:220:184",
"10 184:220:184", "11 184:220:184",
"12 184:220:184", "13 184:220:184",
"14 184:220:184", "15 184:220:184",
"nv 255:255:255", "default 255:255:255")
write(paste(color_rules, collapse="\n"), file="tmp.txt")
# Define the colours of the protected areas
execGRASS("r.colors", map="ProtectedAreas",
rules="tmp.txt")
unlink("tmp.txt")
Create the boxplots, sorting the protected areas in ascending order, based on their median values. Include the interquartile range and median of the whole region. Set the -c flag so the boxplots will be plotted with the same colour as the corresponding parks on the map.
The boxplots in Figure 4 are the same as in Figure 3, except for the colours of the boxplot. The deviating colour of Bardia helps it to stand out. It furthermore allows one to visually link the boxplots and the corresponding areas on the map.
Example 4
In the previous example, we saw how we can include outliers. But that doesn’t tell us where to find those outliers. In this example, we create a map of the outliers We’ll do this for Bardia and Banke national parks.
First, set the region to match the extent of Bardia national park, and mask out all areas outside the protected area.
# Extract/create a vector layer of Bardia
Module("v.extract",
input="TAL_PAS@ProtectedAreas",
where=("name = 'Banke NP'"
"OR name = 'Bardia NP'"),
output="BardiaBanke")
# Set region and mask to Bardia,
Module("g.region", flags="a",
vector="BardiaBanke",
raster="LAI300_20140920")
Module("r.mask", vector="BardiaBanke",
overwrite=True)
# Extract/create a vector layer of Bardia
wh <- "name = 'Banke NP' OR name = 'Bardia NP'"
execGRASSA("v.extract",
input="TAL_PAS@ProtectedAreas",
where= wh, output="BardiaBanke")
# Set region and mask to Bardia,
execGRASS("g.region", flags="a",
vector="BardiaBanke",
raster="LAI300_20140920")
execGRASS("r.mask", vector="BardiaBanke",
flags="overwrite")
Create the boxplots, using the same settings as in example 2. However, this time, create a point layer with the location of the outliers using the map_outliers parameter.
The locations of the outliers are saved as a point layer. If you like, you can convert this to a polygon grid layer. To do so, first convert the point layer to a grid layer, and then convert the raster layer to a polygon layer. This is what I did for Figure 5.
The locations of the outliers are saved as a point layer. If you like, you can convert this to a polygon grid layer. To do so, first convert the point layer to a grid layer, and then convert the raster layer to a polygon layer. This is what I did for Figure 5.
The locations of the outliers are saved as a point layer. If you like, you can convert this to a polygon grid layer. To do so, first convert the point layer to a grid layer, and then convert the raster layer to a polygon layer. This is what I did for Figure 5.
In both Banke and Bardia National Park, we find outliers, representing areas with a very low LAI. As you might have expected, these are the larger rivers and river floodplains.
Not in all parts of the floodplains the LAI values are identified as outliers, though. You can, if you like, change the threshold value below or above which values are considered outliers. You can do this using the range parameter. This parameter determines how far the plot whiskers extend out from the box (default is 1.5 \(\times\) IQR). A smaller range will result in more LAI values being marked as outliers.
r.series.boxplot
So, what are the seasonal changes in the LAI within the TAL region? And how do the differences across seasons compare to the variation in space? We can use the r.series.rast
to draw the boxplot of raster values of the 10-daily LAI values.
Example 5
For example, we can plot the LAI distribution for each 10-day period in a given year, like in the examples below. The boxplot for 2015 were created using the bash command line code below. The 2016 boxplots were created using the Python code.
Set the region to match the extent of the whole TAL region and mask to mask out all areas outside the protected areas.
Get a list with the layer names, and use this as input for r.series.boxplot. Colour the boxplots green, and print the labels vertically, with a font-size of 8.
With the g.list
command, we get a comma separated list with all LAI layers for 2015. We assign this list to the variable A, which we use as input for r.series.boxplot
.
By default, the raster names are used as labels on the x-axis (or y-axis if the plot is rotated). In this case, all raster names start with LAI300_, which is redundant. We can remove that part using the sed
command, and assign the result to the variable B. Now, we can use the text_labels parameters to change the default names, using B as input.
We furthermore set the font size to 8, we change the boxplot colours to green, and we plot the x-axis labels vertically.
With the g.list
command, we get a list with all LAI layers for 2016. Note that the output of the g.list
function, with separator=“,” is a string with the layer names separated by a comma. The string ends with the new line symbol (*). We use string slicing to remove this.
By default, the raster names are used as x-axis labels (or y-axis if the plot is rotated). Because the raster names are rather long, we remove the LAI300_ part, leaving us with a string representing the dates. We use this with the text_labels parameter to define the x-axis labels.
Now, we can run r.series.boxplot
to draw the boxplots. We define the text labels on the x-axis with the text_labels. The other options should speak for themselves, but if not, check the manual page.
With the g.list
command, we get a list with all LAI layers for 2016. Note that the output of the g.list
function, with separator=“,” is a string with the layer names separated by a comma. The string ends with the new line symbol (*). We use string slicing to remove this.
By default, the raster names are used as x-axis labels (or y-axis if the plot is rotated). Because the raster names are rather long, we remove the LAI300_ part, leaving us with a string representing the dates. We use this with the text_labels parameter to define the x-axis labels.
Now, we can run r.series.boxplot
to draw the boxplots. We define the text labels on the x-axis with the text_labels. The other options should speak for themselves, but if not, check the manual page.
The results in Figure 6 show how the distribution of LAI values in 2015 and 2016 varies considerably in space and time. There is a pronounced seasonality, but there are also large differences within the study area, especially in the season with high LAI values (determined by the monsoon season) between the end of June and the end of October. Note the curious dip midway to the peak season in 2015. I wonder where that comes from.
Example 6
Let’s plot the LAI distribution for each 10-day period in 2020. Like in the previous example, we’ll do this for the protected areas, i.e., excluding all areas outside the protected areas. This time, we’ll plot the boxplots horizontally.
Get a list with the layer names, and use this as input for r.series.boxplot. Plot the boxplots horizontally, colour the boxplots orange, and print the labels horizontally. Also plot the outliers.
The second line gets a list of all the LAI layers using the g.list
command and assigns it to the variable A. The third line does the same, but pipes the resulting string with layer names to the sed
command. This command is used to remove the first part of the name, resulting in labels that only contain the date of the first day of the 10-day period. This string with names is assigned to the variable B. Variable A and B are used as input for the map names and text label respectively.
# Get a list of all layers for 2015
A=`g.list type=raster pattern=*LAI300_2020* sep=,`
B=$(echo `g.list type=raster \
pattern=*LAI300_2020* sep=,` | sed s/LAI300_//g)
# Run r.series.boxplot
r.series.boxplot -gho map=$A text_labels=$B \
rotate_labels=0 fontsize=8 \
bxcolor="#ff7f0e" \
plot_dimensions="10,4" \
flier_color="#FFA500" \
flier_marker="o" flier_size=1
# Load the required modules
from grass.pygrass.modules import Module
from subprocess import PIPE
# Get a list of input layers
lay = Module("g.list" , type="raster", pattern="*LAI300_2020*",
separator=",", stdout_=PIPE).outputs.stdout[:-1]
# Create shortened label names
lab = lay.replace("LAI300_", "")
# Create the boxplots
Module("r.series.boxplot", flags=["g", "h", "o"],
map=lay,
text_labels=lab,
rotate_labels=0,
bxcolor="#ff7f0e",
fontsize=8,
plot_dimensions="10,4",
flier_color="#FFA500",
flier_marker="o",
flier_size=1)
# Get a list of input layers
lay = execGRASS("g.list" , type="raster", pattern="*LAI300_2020*",
separator=",", inter=TRUE)
# Create shortened label names
lab = gsub("LAI300_", "", x=lay)
# Create the boxplots
execGRASS("r.series.boxplot",
flags=c("g", "h", "o"), map=lay,
text_labels=lab, rotate_labels=0,
bxcolor="#ff7f0e", fontsize=8,
plot_dimensions="10,4",
flier_color="#FFA500",
flier_marker="o",
flier_size=1)
We can see numerous outliers with very low LAI values in the LAI peak period. However, in general, the spatial variation in LAI in that period is noticeably lower than in 2015 and 2016 (and the other years, see Figure 8). Could it be that rainfall was more uniformly distributed across space in this year? Or was there more cloud coverage, resulting in more averaged values? That is the thing with boxplots, they show you patterns, but explaining those patterns isn’t always easy. We could check the original Sentinel-3/OLCI and PROBA-V data, but that goes beyond the scope of this post.
t.rast.boxplot
If you are working with time series, you should check out the temporal data processing toolbox in GRASS GIS, which offers a powerful set of functions to deal with temporal data. To use those functions, you first need to create a space-time raster dataset (strds) with t.create and register your layers in the strds with t.register.
After you have created a space-time raster datasets (strds) with the LAI data layers, you can use the t.rast.boxplot
function instead of the r.series.boxlot
. An advantage is that even with gaps in the time series, the boxplots will be plotted correctly on the time axis. In addition, you can use the temporal data processing functions to easily select, filter or create new time series, which you can then feed into t.rast.boxplot
.
Example 7
We will create boxplots for each 10-day period between 2014 and 2021 for Bardia National Park. For this example, I have already registered the LAI layers in the spatial-temporal raster dataset LAI_300. If you are not sure how to do this with your own data, see this Wiki page or this post.
First step is to set the region and mask to match respectively the extent and boundaries of Bardia National Park, while maintaining the resolution of the LAI layers.
Now we can create the boxplots representing the temporal and spatial distribution of LAI values for the period 2014-2021.
The bx_width parameter can be used to change the width of the boxplot. The default value is 0.75, which means that the boxplots will be 0.75 \(\times\) the maximum distance on the x-axis between two consecutive periods. Importantly, the function uses the time stamp unit of the strds to determine the maximum width. This means that if the time stamp unit is in days, but the raster layers represent, for example, 10-day periods, you need to multiply the bx_width value you would normally use by 10 to get the same relative width.
We are dealing with a large number of boxplots. It will be difficult to get the labels for all boxplots on the x-axis. We therefore use the -d flag option. With this option, the function attempts to figure out the best tick locations and format to use for the date, making it as compact as possible, while still being complete.
Create boxplots for each of the 10-day periods between 2014 and 202. Use the option -d flag option to let the function find the best location and formats for the x-axis labels.
Given the large number of boxplots, we will need to adjust some settings to achieve a readable result. See the manual page for an explanation of the different options used below.
Given the large number of boxplots, we will need to adjust some settings to achieve a readable result. See the manual page for an explanation of the different options used below.
The boxplot time series in Figure 8 provides a quick overview of the yearly and seasonal patterns, but also the spatial variation in Bardia National Park. As observed earlier, the low spatial variation in LAI values in 2020 stand out, as does the high spatial variation in the LAI peak period in 2014.
Out of curiosity, I also plotted the monthly rainfall, based on the CRU dataset6. Comparing the two graphs shows how the peak in LAI values follows the peak in rainfall, as one would expect. I wonder though where the small peak (or a slowdown in the decrease in LAI values) around the end/start of the year comes from?
Example 8
Let’s now compare the boxplot time series for March and August. To do so, we first need to compute the average monthly LAI, extract the LAI layers for the two months and store these in two new time-space raster data sets, which we can use as input for t.rast.boxplot
.
Create two strds for the average LAI values for March and August, respectively. Next, create boxplot time series for both months.
Compute the average monthly LAI values using t.rast.aggregate
function.
Extract the raster layers for March using t.rast.extract
, and use this as input in t.rast.boxplot
to create the boxplot time series for March.
# Extract March data layers
t.rast.extract input=LAI_monthly \
where="strftime('%m', start_time)='03'" \
output=LAI_March nprocs=4
# Create boxplot time series plot
t.rast.boxplot -g input=LAI_March dpi=300 \
plot_dimensions=3,3 \
bx_width=10 median_lw=1.2 \
date_format="%Y" \
bx_color=51:118:183:255 \
median_color=255:255:255:255
Extract the raster layers for August using t.rast.extract
, and use this as input in t.rast.boxplot
to create the boxplot time series for August.
# Extract August data layers
t.rast.extract input=LAI_monthly \
where="strftime('%m', start_time)='08'" \
output=LAI_August nprocs=4
# Create boxplot time series plot
t.rast.boxplot -g input=LAI_August dpi=300 \
plot_dimensions=3,3 \
bx_width=10 median_lw=1.2 \
date_format="%Y" \
bx_color=51:118:183:255 \
median_color=255:255:255:255
Compute the average monthly LAI values using t.rast.aggregate
function.
Extract the raster layers for March using t.rast.extract
, and use this as input in t.rast.boxplot
to create the boxplot time series for March.
# Extract March data layers
Module("t.rast.extract", input="LAI_monthly",
where="strftime('%m', start_time)='03'",
output="LAI_March", nprocs=4)
# Create boxplot time series plot
Module("t.rast.boxplot", flags="g",
input="LAI_March", median_lw=1.2,
plot_dimensions="3,3",
date_format="%Y", bx_width=10,
bx_color="51:118:183:255",
median_color="255:255:255:255")
Extract the raster layers for August using t.rast.extract
, and use this as input in t.rast.boxplot
to create the boxplot time series for August.
# Extract August data layers
Module("t.rast.extract", input="LAI_monthly",
where="strftime('%m', start_time)='08'",
output="LAI_August", nprocs=4)
# Create boxplot time series plot
Module("t.rast.boxplot", flags="g",
input="LAI_August", median_lw=1.2,
plot_dimensions="3,3",
date_format="%Y", bx_width=10,
bx_color="51:118:183:255",
median_color="255:255:255:255")
Compute the average monthly LAI values using t.rast.aggregate
function.
Extract the raster layers for March using t.rast.extract
, and use this as input in t.rast.boxplot
to create the boxplot time series for March.
# Extract March data layers
execGRASS("t.rast.extract", input="LAI_monthly",
where="strftime('%m', start_time)='03'",
output="LAI_March", nprocs=4)
# Create boxplot time series plot
execGRASS("t.rast.boxplot", flags="g",
input="LAI_March", median_lw=1.2,
plot_dimensions="3,3",
date_format="%Y", bx_width=10,
bx_color="51:118:183:255",
median_color="255:255:255:255")
Extract the raster layers for August using t.rast.extract
, and use this as input in t.rast.boxplot
to create the boxplot time series for August.
# Extract August data layers
execGRASS("t.rast.extract", input="LAI_monthly",
where="strftime('%m', start_time)='08'",
output="LAI_August", nprocs=4)
# Create boxplot time series plot
execGRASS("t.rast.boxplot", flags="g",
input="LAI_August", median_lw=1.2,
plot_dimensions="3,3",
date_format="%Y", bx_width=10,
bx_color="51:118:183:255",
median_color="255:255:255:255")
The results in (Figure 10) suggest that the conditions in 2015 seem to have been special, in that the LAI WAS relatively high in March, and high in August compared to the other years.
However, we must be careful when comparing March with August because the boxplots for these two months are plotted on different scales. To make it easier to compare the two plots, we can force them to use the same scale by using the axis_limits parameter. In this case, we should set the limits to 0 and 7.
Create boxplot time series for the average LAI values for March and August. Set for both plots the minimum and maximum value of the y-axis to respectively 0 and 7.
Use t.rast.boxplot
to create the boxplot time series for March. Set the minimum and maximum value of the y-axis to respectively 0 and 7.
Use t.rast.boxplot
to create the boxplot time series for August Set the minimum and maximum value of the y-axis to respectively 0 and 7.
Use t.rast.boxplot
to create the boxplot time series for March. Set the minimum and maximum value of the y-axis to respectively 0 and 7.
Use t.rast.boxplot
to create the boxplot time series for August Set the minimum and maximum value of the y-axis to respectively 0 and 7.
Use t.rast.boxplot
to create the boxplot time series for March. Set the minimum and maximum value of the y-axis to respectively 0 and 7.
Use t.rast.boxplot
to create the boxplot time series for August Set the minimum and maximum value of the y-axis to respectively 0 and 7.
It depends, of course, on what you want to show, but Figure 11 does more clearly show how the variation in LAI values across the parks of the TAL region is considerably larger in the season with high LAI values.
Another, perhaps more obvious example of why you might want to change the axis limits is when you want to compare two different areas, with very different value ranges.
Footnotes
I created the plugin for the project Save the tiger! Save the Grasslands! Save the Water!. Within the project, our Innovative Bio-Monitoring research group (IBM) focusses on the mapping of eco-hydrological conditions in the TAL region.↩︎
CGLS. (2022, update). Leaf Area Index. Copernicus Global Land Service. https://land.copernicus.eu/global/products/LAI↩︎
The layers TAL_PAS and TALregion were created by my students. The former is based on ProtectedPlanet, but also had to manually digitize some boundaries based on various other sources. The latter is used on various sources.↩︎
It might make more sense to limit comparisons to a specific vegetation types. To accomplish that, we would also need to mask out all but the selected vegetation type.↩︎
Note that you can also set the colours interactively using the options manage colour rules interactively from the raster > manage colours menu↩︎
CRU TS v. 4.06 from https://crudata.uea.ac.uk/cru/data/hrg/.↩︎