Import GBIF data using RGBIF

Introduction

The objective is to download GBIF data using R, and store the resulting data in a GRASS GIS database or save the data as geopackage. One can use GRASS GIS (GRASS Development Team 2020) and R (R Core Team 2020) in conjunction in two different ways. One is to use R within a GRASS GIS session. This means that we start R (or RStudio) from the GRASS GIS command line. The other option is to use GRASS GIS within a R session. This means we connect to a GRASS GIS location/mapset from within R (or RStudio). For more details, see the GRASS GIS Wiki. The first approach is arguably the easiest, and is what we will be using below. If you just want to download and use the data in R, you can simply skip the parts about GRASS GIS.

Kalmia latifolia distribution

Mountain Laurel (Kalmia latifolia) is a mid-sized shrub that can be found in acidic forests, bluffs, bogs, along sandhill streams, and in a wide range of other habitats, nearly ubiquitous in the mountains, up to at least 1600m, while more restricted in habitat in the lower Piedmont and Coastal Plain (Society 2020).

Kalmia latifolia flower close-up. Source: Liam Kaplan, CC BY-SA 3.0, via Wikimedia Commons.

Figure 1: Kalmia latifolia flower close-up. Source: Liam Kaplan, CC BY-SA 3.0, via Wikimedia Commons.

In this example, we will download the presence points of this species in the USA and write it to a GRASS GIS database. In essence, it only takes 4 lines of code.

KL <- occ_data(scientificName="Kalmia latifolia", hasCoordinate=TRUE,
               geometry=NCext, hasGeospatialIssue=FALSE)
coordinates(KL) <- cbind(KL$decimalLongitude, KL$decimalLatitude)
sppoints <- st_as_sf(x = KL$data,
                     coords = c("decimalLongitude", "decimalLatitude"),
                     crs=4326)
writeVECT(SDF=KLnew[,c(1,2,5)], vname="KL3", driver="SQLite")  

For those less familiar with R, GRASS GIS or the combination of the two, let’s go into a bit more detail. Below we’ll go over the code above step by step.

Import GBIF data in R

In this example we start GRASS GIS first and then started R (or Rstudio) from the GRASS GIS command line. After opening R, we load the libraries sf (Pebesma 2018) voor het werken met ruimtelijke vector data rgbif Chamberlain and Boettiger (2017) to get the gbif occurrence data, maptools (Bivand 2021) for displaying a map of the data in R, the rgrass7 (Bivand 2021) package to interact with GRASS GIS and the mapr (Chamberlain 2020) to plot the gbif data.

library(rgrass7)
library(sf)
library(rgbif)
library(mapr)
library(maptools)

Area of interest

The GBIF data set, from which we will download the data later, contains data for the whole world. If you are only interested in occurrence data in a specific region, you can define the bounding box of your area on interest, as in the first example below, or provide the name of the country, as in the second example. So, suppose we are interested in the occurrence data in North Carolina. The bounding box of the state is north: 36:36:30N, south: 33:36N, west: 84:20W, and east: 75:05W. We can set this extent in R by creating a vector with the minimum and maximum X-coordinates and the minimum and maximum Y-coordinates. Next, we can convert the vector to a bounding box.

# Define extent & boundding box
NCext <- c(xmin=-84.333, xmax=-75.08333, ymin=33.6, ymax=36.60833)

# Create a bounding box object
NCbbox <- st_bbox(obj=NCext, crs=4326)

Following these two steps makes the code arguably more readable. But if you prefer, you can combine the two lines in one. Matter of taste I guess.

We are running R from within a GRASS GIS session. If in GRASS the region extent is already correctly defined, you can extract the bounding box information by reading in the GRASS location metadata.

Click here to read more…

You can do this using the gmeta functions from the rgrass7 package. That will give you a list object with information about the location, mapset, region settings and projection.

Now, first you need to get the names of the list elements that define the bounding box. As it turns out, the names of the list elements for the north, south, east and west coordinates are n, s, e and w. You can use this to extract them and define the extent.

# Get the names of the list objects
NCmeta <- gmeta()

# Create a vector with the bounding box coordinates
NCext <- c(xmin=NCmeta$w, xmax=NCmeta$e, ymin=NCmeta$s, ymax=NCmeta$n)

# Create a bounding box object
NCbbox <- st_bbox(obj=NCext, crs=4326)

If you want to check if the extent is correctly defined, you can plot the bounding box on top of the world maps. In the script below, NCbbox object is first converted to a spatial data layer of the class sf with the st_as_sfc function. Next, we get the a vector layer with national boundaries using the function wrld_simpl from the maptools packages. As last step, both spatial layers are plotted.

# Convert bounding box to vector layer
NCreg <- st_as_sfc(NCbbox)

# Get national boundaries
data(wrld_simpl)

# Plot national boundaries and bounding box
plot(wrld_simpl, col = "white", axes = T)
plot(NCreg, col = "red", add=TRUE)
Global map with country boundaries and the red box indicating the area on interest (the state North Carolina in the USA).

Figure 2: Global map with country boundaries and the red box indicating the area on interest (the state North Carolina in the USA).

Import the data

OK, so now we have defined the extent, it is time to get the point data of our target species; Kalmia latifolia. The rgbif package has a very convenient function occ_data to do this. The area from which we want to download the data can be set by the parameter geometry, which is set to the bounding box, described in Well Known Text (WKT) format.

To convert the bounding box NCbbox we have defined earlier, we can use the gbif_bbox2wkt function. This requires a vector with coordinates of the corners of the bounding box in the order xmin, ymin, xmax, ymax.

NCext <- gbif_bbox2wkt(bbox=c(-84.333, 33.6, -75.08333, 36.60833))

Or perhaps easier, we can use the function st_as_text to convert the polygon layer NCreg we created earlier to a WKT format. Both method should yield the same result.

NCext <- st_as_text(NCreg)

Now we have the bounding box defined in the approprioate format, we can get the gbif occurrence data. As a minimum we need to give the scientific name of the species for which we want to download the data. In addition, we’ll need to set the parameter hasCoordinate to TRUE to ensure only records with coordinates are downloaded.

KL <- occ_data(scientificName="Kalmia latifolia", hasCoordinate=TRUE,
               geometry=NCext, hasGeospatialIssue=FALSE)
KL <- load("KL.RData")

There are a lot more parameters to further fine-tune the search. Make sure to read the help file for details. If you are not yet familiar with the GBIF data, it might be a good idea to do a search on the GBIF website itself to familiarize yourself with the different filters and options. Try e.g., to search for Kalmia latifolia.

Visualizing the data in R

Using mapr

The KL is a list with two elements. The first is a list with the metadata and the second a dataframe with the actual records, including the coordinates. We can quickly map the points in the dataframe using the map_ggplot() from the mapr package. This allows you to define extent of the map using the map parameter, with the following options: world, world2, state, usa, county, france, italy, or nz. The size parameter is used to set the size of the points.

# Plot the data using the map_ggplot function from the mapr package
mapggplot <- map_ggplot(KL, map="state", size=0.5)

The options state and county will give you a map of the USA with respectively the state and county boundaries. It thus gives you a simple world map of the USA with state boundaries and the location of Kalmia latifolia within the North Carolina region.

Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia within the bounding box of the state North Carolina.

Figure 3: Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia within the bounding box of the state North Carolina.

The mapr package contains various other mapping functions to plot maps with a background layer from e.g., Google or Openstreetmap, and to create interactive maps. See the manual pages for more information.

Using ggplot2

The mapping functions in mapr are fine when all you want is a quick overview. However, they don’t allow you to customize the map much. If you want something more involved, you can use for example ggplot2 to create your own map. In the example below, we are downloading the data for the whole of the USA. Instead of the geometry parameter, we can use the country parameter, which is a convenient way to define the country of interest (see the help file for details).

# Download Kalmia latifolia presence data for the USA
KL <- occ_data(scientificName="Kalmia latifolia", hasCoordinate=TRUE,
               country="US", hasGeospatialIssue=FALSE)

For the map with state borders of the contiguous USA, we use the map_data from the ggplot2 package. This gives us a data.frame with coordinates which can be plotted as a polygon with the geom_polygon function.

# Get USA state map
NCstate = map_data("state")

# Create the map
p <- ggplot() + 
  coord_fixed(1.2) + 
  geom_polygon(data = NCstate, aes(x=long, y = lat, group = group),
               fill="white", col="grey", lwd=0.3) +
  geom_point(data=KL$data, aes(x = decimalLongitude, y=decimalLatitude,
                               color="Presence points"), cex=0.5) +
  labs(color = "Kalmia latifolia") +
  theme(legend.position=c(0.14,0.15))
Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia. Map created with ggplot2.

Figure 4: Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia. Map created with ggplot2.

The map clearly shows that the species is largely confined to the east of the USA. Next step is to check the data for consistency and possible errors.

Cleaning data

You should always have a critical look at the quality of your data. For example, look at the map again, do all records look plausible? Are there any duplicates? And if so, how do we deal with them. A special tool for cleaning point occurrence records and identifying geographic outliers is provided by the CoordinateCleaner toolset for R (Zizka et al. 2019). Also the dismo package (Hijmans et al. 2020) provides some useful code to check and clean GBIF data.

library(CoordinateCleaner)
library(countrycode)
library(dplyr)

# First select columns of interest
KL <- KL$data %>%
  select(species, decimalLongitude, decimalLatitude, countryCode,
         individualCount, gbifID, family, taxonRank,
         coordinateUncertaintyInMeters, year,basisOfRecord,
         institutionCode, datasetName)

We can use the automatic cleaning algorithm of CoordinateCleaner to identify errors that are common to biological collections, including: sea coordinates, zero coordinates, coordinate - country mismatches, coordinates assigned to country and province centroids, coordinates within city areas, outlier coordinates and coordinates assigned to biodiversity institutions. You can switch on each test individually using logical flags, and modify the sensitivity of most individual tests using the “.rad” arguments. See ?clean_coordinates for help.

First step is to get the country code in the right format. The code in the gbif data is in ISO2 format, and we need to convert it to ISO3 format. We use the countrycode function from the countrycode package for that.

#convert country code from ISO2c to ISO3c
KL$countryCode <-  countrycode(KL$countryCode, origin =  'iso2c', destination = 'iso3c')

Now we can use the The clean_coordinates function to flag common problems in the data. The function carried out a lot of tests by default, but you can include or exclude tests using the tests parameter.

#flag problems
KLdf <- data.frame(KL)
flags <- clean_coordinates(x = KLdf, 
                           lon = "decimalLongitude", 
                           lat = "decimalLatitude",
                           countries = "countryCode",
                           species = "species",
                           tests = c("capitals", "centroids", "equal","gbif",
                                     "institutions", "countries"))

The result is the same data.frame as the input data.frame, but with one column added for each test. In that column, TRUE means it is a clean coordinate entry, FALSE means that the entry is potentially problematic. We can summarize the results with the summary function

summary(flags)
##     .val     .equ     .cap     .cen     .con     .gbf    .inst .summary 
##        0        0       12        1        2        0        1       16

The automatic test flagged 16 of the 500 records (3.2%) as potentially problematic. Of these 12 are flagged because they occur within a 10 km radius from the capital, one which is within a kilometer of the countries centroid, and two records with coordinates that do within the country in the country column. The plot.spatialvalid function offers a convenient way to plot the flagged records.

plot(flags, lon = "decimalLongitude", lat = "decimalLatitude")
Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia in yellow, and records with potential problems in purple.

Figure 5: Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia in yellow, and records with potential problems in purple.

See Zizka et al. (2019) for an interpretation of the different possible errors and how to deal with them. For the purpose of this tutorial we will keep the flagged records as well as the added columns so we can later explore these errors further in GRASS GIS.

Import in GRASS GIS

First step is to concert the data to a simple feature (sf) spatial object. We do this with the st_as_sf function. As input parameters, we need to give the data frame (x), the columns with the coordinates (coords) and the coordinate reference system (_crs+). The latter can be set using the EPGS code. The GBIF data is downloaded in latlon WGS84, which has the EPGS code 4326.

sps <- st_as_sf(x = flags, crs=4326,
                coords = c("decimalLongitude", "decimalLatitude"))

Next, we can use the writeVECT function from the rgrass7 package to export the point layer to the GRASS GIS database. Note, this only works if you started R from within the GRASS GIS session, as mentioned earlier. In the example below, only a few columns are exported, because I won’t be needing the other ones. If you want to retain all columns, note that some columns may contain characters causing problems when importing in GRASS. In that case you will need to check and clean the offending columns. Note, with the function use_sf the writeVECT function knows that the input data is a sf class vector objects rather than the older sp class.

use_sf()
writeVECT(SDF=sps[,c(1,2,5, 14, 15, 16, 18)], vname="Kallat", driver="SQLite")
Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia. Map created in GRASS GIS.

Figure 6: Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia. Map created in GRASS GIS.

Special case - reprojecting

What if the downloaded data is in another projection than the projection of the Location/Mapset in your GRASS database? Let’s say we want to import the data in a GRASS GIS database with the projection NAD83(HARN) / North Carolina (EPSG 3358). This is in fact the CRS of the North Carolina (NC) sample GRASS database.

This means that the data needs to be reprojected. There are various ways one can do this. (A) One option is to reproject the data in R before writing it to GRASS. (B) A second option is to import the data in a location/mapset with the same projection as the point data (as we have just done above), and than change to the target Location/mapset in GRASS and use use r.proj to import and reproject the data from the first location/mapset into the target location/mapset.

We will go for option A here. As said, the data downloaded earlier is in latlon WGS84. You can set the projection using th e

We can set the projetion using the EPGS code, or you can get the projection details of our target mapset using the getLocationProj function from the rgrass7 package (Note, I am assuming that we have started GRASS with the target Location/mapset). Next, to reproject the sf spatial data layers using the function st_transform.

proj <- getLocationProj()
spsnew <- st_transform(x=sps, crs=proj)

We can now import the data in the grass gis database, for which we use again the writeVECT function.

writeVECT(SDF=spsnew[,c(1,2,5, 14, 15, 16, 18)], vname="Kallat3", driver="SQLite")

Visualize the data in GRASS GIS

Now you can further analyze your data in GRASS GIS or create a map. GRASS offers different ways to create maps, including the ps.map and g.gui.psmap functions (command line and GUI respectively), the m.printws function (you’ll need to install the addon) and the d.* family of GRASS functions. The map below was created using the latter.

Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia. Map created in GRASS GIS with the d.* family of GRASS functions.

Figure 7: Map of contiguous USA, with the distribution of occurrences of Kalmia Latifolia. Map created in GRASS GIS with the d.* family of GRASS functions.

Click here for the code to create the map

The code below was used to create the map. Note, this is for on the command line (i.e., no R code). For more details, check out the help file of each of the functions used. For the GRASS environmental variables (GRASS_RENDER_* in line 21-27), see this help page.

# Set region (we want to set the region slightly larger than the extent
# of the usa
g.region vector=contiguousUSA res=25
g.region n=n+200000 s=s-50000 w=w-200000 e=e+50000

# Compute with/height ratio image
# Note, the bc function rounds down to the nearest integer. To round up
# to the nearest integer when computing the width, I added 1 (+1).
eval `g.region -g`
ratio=`bc <<< "scale=4; $rows / $cols"`
height=1200
width=`bc <<< "scale=0; $height / $ratio + 1"` 

# Create vector layer with an extent matching that of the region. This
# will serve as background layer
v.in.region --overwrite output=bckgrnd cat=1 --overwrite

# Set environmental variables
# The commands below will directly render the images to a file. If you want
# to render the map to a monitor, use d.mon.
export GRASS_RENDER_IMMEDIATE="cairo"
export GRASS_RENDER_FILE="Kalmia_latifolia.png"
export GRASS_RENDER_HEIGHT=$height
export GRASS_RENDER_WIDTH=$width
export GRASS_RENDER_BACKGROUNDCOLOR="#FFFFFF"
export GRASS_RENDER_TRANSPARENT="TRUE"
export GRASS_RENDER_FILE_READ="TRUE"

# Define the layers
d.vect map=bckgrnd type=area color=234:234:234 fcolor=234:234:234
d.grid -w size=5 color=255:255:255 border_color=160:160:160 fontsize=12
d.vect contiguousUSA type=area color=100💯100 fcolor=250:250:250
d.vect map=Kallat color=red fill_color=red icon=basic/point \
       size=10
d.text text="Mountain Laurel" at=80,85 size=4 align=cc color="black" font=arial
d.text text="(Kalmia latifolia)" at=80,80 size=3 align=cc color="black" font=ariali

If you think, that is a whole lot of code to create a simple map, I am with you. Creating as script is a sensible choice if you want to automate your work flow. And remember, you can run the code also from within e.g., R or Python. So also within those environments you can create maps using GRASS functionality.

If you have to create a map only one time, it will be easier to create the map in the Map Display and use the ‘save to file’ option to save the resulting map as a image file. Alternatively, after creating the map in the Map Display, save the workspace (menu: File → Workspace → Save as) and use that as input in m.printws to save the map as an image.

Export as geopackage

Instead of importing the data in a GRASS GIS database, you can export is as Geopackage (or any other suitable format), so you can use the data easily in e.g., QGIS. Also in this case, the first step is to concert the data to a simple feature (sf) spatial object. In case you skipped the previous section, let’s repeat this step below.

sps <- st_as_sf(x = flags, crs=4326,
                coords = c("decimalLongitude", "decimalLatitude"))

As we have seen above, if needed we can first reproject the data. Let’s project the data to NAD83(HARN) / North Carolina (EPSG 3358), and then export it as Geopackage using the st_write function. Note that I write the data as layer Kallatgbif in a (new) geopackage Kallat.gpkg.

spsnew <- st_transform(sps, crs=3358)
st_write(obj=spsnew, dsn="Kallat.gpkg", layer="Kallatgbif")

Other R packages

As usual, there are more than one way to skin the proverbial cat. You can also use the dismo, taxize and spocc packages to access GBIF occurrence data. Each have their own strengths and weaknesses, so best to try them out to see which one fits your work flow best.

References

Bivand, Roger. 2021. “Rgrass7: Interface Between GRASS 7 Geographical Information System and r.” Manual. https://CRAN.R-project.org/package=rgrass7.
Chamberlain, Scott. 2020. “Mapr: Visualize Species Occurrence Data.” Manual. https://CRAN.R-project.org/package=mapr.
Chamberlain, Scott, Vijay Barve, Dan Mcglinn, Damiano Oldoni, Peter Desmet, Laurens Geffert, and Karthik Ram. 2021. “Rgbif: Interface to the Global Biodiversity Information Facility API.” Manual. https://CRAN.R-project.org/package=rgbif.
Chamberlain, Scott, and Carl Boettiger. 2017. “R Python, and Ruby Clients for GBIF Species Occurrence Data.” PeerJ PrePrints. https://doi.org/10.7287/peerj.preprints.3304v1.
GRASS Development Team. 2020. Geographic Resources Analysis Support System (GRASS GIS) Software (version 7.8). USA: Open Source Geospatial Foundation. http://grass.osgeo.org.
Hijmans, Robert J., Steven Phillips, John Leathwick, and Jane Elith. 2020. “Dismo: Species Distribution Modeling.” Manual. https://CRAN.R-project.org/package=dismo.
Pebesma, Edzer. 2018. “Simple Features for r: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
R Core Team. 2020. “R: A Language and Environment for Statistical Computing.” Manual. Vienna, Austria. https://www.R-project.org/.
Society, North Carolina Native Plant. 2020. “Kalmia Latifolia.” North Carolina Native Plant Society. 2020. https://www.ncwildflower.org/plant_galleries/details/kalmia-latifolia.
Zizka, Alexander, Daniele Silvestro, Tobias Andermann, Josue Azevedo, Camila Duarte Ritter, Daniel Edler, Harith Farooq, et al. 2019. “CoordinateCleaner: Standardized Cleaning of Occurrence Records from Biological Collection Databases.” Methods in Ecology and Evolution, no. 10: –7. https://doi.org/10.1111/2041-210X.13152.
Next