Working with the global Building Footprints data

Or, how to deal with very large geojson files

open data
data import
data wrangling
The open building footprints from Bing Maps is a great data set. However, it is not an easy one to work with due to the size. Luckily, with just a few lines of code, you can split the file in manageable chunks that are easier to deal with. tags: GDAL, OGR, open data, Nepal, data import, data wrangling
Author

Paulo van Breugel

Published

June 5, 2022

Introduction

I am currently supervising students who are working on their final graduation assignment. Their goal is to map and characterize the environmental conditions in nature reserves in the Terai Arc Landscape (TAL). Because we want to account for anthropogenic pressures in these areas, we need to know where people live. This is where the global building footprints come in.

Global building footprints, source: GlobalMLBuildingFootprints

The open building footprints data set, released by Bing Maps, includes 777 million buildings, based on Bing Maps, Maxr and Airbus imagery between 2014 and 2021. The website reports a precision and recall for South Asia of 94.8% and 76.7% respectively. They also claim that the quality is on par with, or exceeds, the hand digitized buildings in OpenStreetMap. From my admittedly very limited experience, OpenStreetMap for Nepal is pretty good, so I am curious whether the buildings footprints will indeed improve a lot on what OpenStreetMap provides. In any case, a clear advantage of the OpenStreetMap data set is that it was developed using a uniform, standardized methodology. In contrast, the status of the OpenStreetMap data can vary significantly between countries.

There is no denying, however, that using the data is a challenge. You can download the data per country, but most data files are still huge. For example, for a relative small country as Nepal, the file size is 465 MB when zipped, and 1.9 GB when uncompressed. An additional problem is that the data is distributed as GeoJSON files, which are not indexed, making them slow to load in, e.g., QGIS.

There is a silver lining, though. The GeoJSON files come in a “line-by-line” format. This means that each feature appears in its own line in the file, represented by a standalone FeatureCollection. With a line-by-line format, only one line of the file (or subset of lines) need to be pulled into memory and parsed at a time, making it possible to support huge files (source). In our case, this means that the original GeoJSON file can be split in manageable chunks, and converted to a format more suitable for further analysis and visualization, as explained in this Youtube video.

Import data for the TAL region

In this post, I will show you how I got the buildings for the Nepali part of the TAL, using the method suggested in the video mentioned above. I had to add two steps; selecting the building footprints within the TAL, and merging the resulting files into one vector layer. Note that the main functions, split, ogr2ogr and ogrmerge are command line tools. However, I am not very familiar with the command line syntax. I therefore ran these functions using the Python subprocess.call function instead. In the next sections, I show you how.

Functions and parameters

To use the code below, you first have to download the building footprints data for your country, and unzip the GeoJSON file in a folder (yes, I know, I could have scripted that part as well. Maybe next time). After downloading the data, adapt the parameter settings below accordingly.

# Load libraries
import glob
import subprocess
import os

# Parameters
wd = "/home/paulo/Downloads/Nepal"
file_name = "Nepal.geojsonl"
output_file_name = 'TAL.gpkg'
chunk_size = 100000
clipsource = os.path.join(wd, 'TALregion.gpkg')
cliplayer = "Talregion"

Split the data

The next step is to split the original GeoJSON file into smaller chunks. I use chunk sizes of 10,000 lines each, but you can change this. The first two lines are to define the name of the input file and the prefix of the names of the output files, respectively. The command on line 3-5 is to construct the split command with its parameters. On the last line, the command is run through the Python subprocess.call function. Note that you need to set shell=True for this to work.

file_in = os.path.join(wd, file_name)
files_out = os.path.join(wd, "chunk_")
cmd1 = ("split -l {} {} {} "
        "--additional-suffix=.geojsonl".
        format(chunk_size, file_in, files_out))
process = subprocess.call(cmd1, shell=True)

The result is a folder with, in this case, 64 GeoJSON files. To convert these to Shapefiles (you can use any other format supported by OGR), you first need a list with the names of the files you just created. The code below lists all GeoJSON files in the wd folder that start with chunk_.

listOfFiles = list()
for (_, _, filenames) in os.walk(wd):
    listOfFiles += [file for file in filenames
                    if file.endswith(".geojsonl")
                    if file.startswith("chunk_")]

Convert the data

The following code loops through the list with file names, converting each of the GeoJSON files to a Shapefile, while clipping the geometries using the clip layer defined with the -clipsrclayer parameter. The -f “GPKG options tells the ogr2ogr function that the output should be a GeoPackage. The file_out is the name of output vector layer, and the file_in the name of the input file. The -clipsrc parameter tells ogr2ogr that only the features within the boundaries of {cliplayer} should be included.

If you get warnings about mixing polygons and multipolygons not being allowed, you can use the -nlt PROMOTE_TO_MULTI to automatically promote layers that mix polygon or multipolygons to multipolygons. And in case you are getting warnings about problematic geometry collections, you may try to set the -explodecollections switch. Lastly, use the -skipfailures switch in case the data set contains invalid features. That will tell the function to continue after a failure, skipping the failed feature.

dirpath = os.path.join(wd, "gpkgs") # folder for output files
os.mkdir(dirpath) # create the folder defined above
for file in listOfFiles:
    file_in = os.path.join(wd, file)
    filename = file.replace("geojsonl", "gpkg")
    file_out = os.path.join(dirpath, filename)
    call = ("ogr2ogr -f GPKG {} {} "
            "-clipsrc {} -clipsrclayer {} "
            "-nlt PROMOTE_TO_MULTI"
            "-explodecollections "
            "-skipfailures".
            format(file_out, file_in, 
                   clipsource, cliplayer))
    subprocess.call(call, shell=True)

Check the documentation for a description of all the parameters. Make sure to read the section with performance hints. For example, for SQLite, explicitly defining -gt 65536 ensures optimal performance while populating some table. However, note that the -skipfailures switch overrides -gt and sets the size of transactions to 1.

Merge the data

The last step is to merge the GeoPackages created in the previous step. One way to do this is, is with ogrmerge.py. This is a script that takes as input several vector data sets and copies them to a target data set. By default, each input vector layer is copied as a separate layer into the target data set. So, to create one single target layer, use the -single switch.

path_gpkg = os.path.join(dirpath, "*.gpkg")
out_gpkg = os.path.join(wd, output_file_name)
os.chdir(os.path.join(wd, "gpkgs"))
subprocess.call(("ogrmerge.py -single "
                "-f GPKG -o {} *.gpkg".
                format(out_gpkg)),
                shell=True)

In my case, the last step took a bit of time. In case you are dealing with very large, densely populated areas, be careful to check if you can still handle the resulting data on your computer, and if you really need to have the data in one big file.

Acknowledgement

The work described here is carried out in the framework of the NWO financed project Save the tiger! Save the Grasslands! Save the Water!, a multidisciplinary project carried out by a consortium of Dutch and Nepali partners. Within the project, the Innovative Bio-Monitoring research group (IBM) from the HAS University of Applied Sciences focuses on mapping the variability in eco-hydrological boundary conditions for grasslands in TAL nature reserves.