Working with the Microsoft Road Detections dataset

Extracting and converting a subset of the data

machine learning
open source
Bing Maps released a data set with mined roads around the world. It is a great data set, but not an easy one to work with due to the size. Luckily, with just a few lines of code, you can extract the data for a user-defined region and import the resulting data in a geoPackage. tags: GDAL, OGR, open data, Nepal, data import, data wrangling

Paulo van Breugel


January 2, 2023


Bing Maps recently released a data set with the roads around the world. The Microsoft Road Detections (MS roads) data set was created using automatic image recognition, using as input Bing Maps imagery between 2020 and 2022, including high resolution Maxar and Airbus imagery. The data is freely available for download and use under the Open Data Commons Open Database License (ODbL).

What I like about this data set is that it is meant to complement the OpenStreetMap (OSM) data. In fact, for several regions, you can choose between the complete data set, or the data set containing the machine learning derived roads missing from OSM.

The data

The data comes in an unusual format, a tsv file. It contains a column for country codes and a column containing GeoJSON objects. One can download files for different region. Unfortunately, these files are still rather large. For example, the data set for South Asia is, when decompressed, 3 GB. Too much to handle in QGIS, on my computer at least. So, how to deal with this data?

As shown by John Bryant, it only requires a few lines of code to download the data, extract the records for your country of choice, and convert it to a GeoPackage. And with some small adaptations, you can extract the data for any user-defined area.

Extract the data

As an example, I’ll show how to extract the roads for the Terai Arc Landscape (TAL), a region that straddles the border of Nepal and India. Both countries are part of South Asia, so the first step is to download the data for that region.

  # Step 1: download the data

Next step is to select and extract all records (lines) for Nepal and India. The codes used for these two countries are respectively NPL and IND. To select and extract these, I use the zgrep command with an extended regular expression (-E ' IND|NPL ) 1.

  zgrep -E 'IND|NPL' | cut -f2 > IndNpl.geojson

The resulting geojson file can be converted to a GeoPackage using the ogr2ogr function. To only extract the roads of the TAL region, I use the -clipsrclayer parameter to tell ogr2ogr to clip and extract the geometries using the vector layer TALregion in the TALregion.gpkg. This is a vector layer with the boundaries of the TAL region.

  ogr2ogr -f GPKG TALroads.gpkg  IndNpl.geojson \
    -clipsrc TALregion.gpkg -clipsrclayer TALregion \
    -nlt PROMOTE_TO_MULTI -explodecollections \

Note, running the code above may take a while, so this would be a good time to grab a cup of coffee or whatever is your favorite beverage 2.

Comparison with the OSM data

To get an idea of how the resulting data set compares to the OSM data of the same region 3, I downloaded the OSM roads data from Geofabric. Comparing both data sets shows that in the Indian part of the TAL region, some areas are poorly covered by OpenStreetMap (Figure 1). The Microsoft Road Detections dataset data set, on the other hand, shows some gaps in the northwest of the Nepalese part of the TAL region. In these more rugged and forested areas, quite a few roads are not, or only partly captured by the MS roads data set.

Figure 1: The road network according to OpenStreetmap (above) and based on the MS road data (below)

Next, I created line density layers based on both data sets, using the line density function in QGIS. The settings I used were a search radius of 5 km and a raster cell size of 5 km. Next, I computed the difference in road density estimates based on the OSM and ML data sets (Figure 2).

Figure 2: Differences in estimated road densities (km/km$2$) between the ML and OSM roads data. In the redder areas, the OSM estimates are higher, while in the darker blue areas, the ML estimates are higher.

Although the OSM data coverage is poor in several areas, there are other areas where the OSM map shows a much higher road density that the MS road datalayer (Figure 2). My first impression (admittedly based on very limited comparison) is that the OSM data is often right here, as quite a few of these roads and tracks are fairly easy to distinguish on e.g., Google earth imagery.

All in all, the MS roads data set seems to be more consistent at a regional scale, but one should probably be careful to compare areas with very different topographic or land cover characteristics. The OSM data has the clear advantage that it distinguishes different types of roads and tracks.

On the website, Microsoft states that the data is been released because of its ‘continued interest in supporting a thriving OpenStreetMap ecosystem’. With both data sets having their strong and weak points, it will be interesting to see if or to what extent the MS roads data will be incorporated in OSM. For now, if you need data on roads, it might be a good idea to check both to see which is most suitable for your particular area and analysis.


  1. Note that I am running the above on my Linux computer. I am not sure how to accomplish the same on the command line in Windows. I only got as far as to extract the GeoJSON objects from the tsv file, using the PowerShell code below.

    Get-Content .\AsiaSouth-Full.tsv | \
      Foreach {($_ -split '\s+', 2)[1]} | \
      Export-Csv IndNpl.json

    If you know how to filter out the lines for Nepal and India using PowerShell on Windows, let me know in the comments.↩︎

  2. If the file is too large to handle, you can use the approach described in this post to parse the file line by line. In that post, I show how to import the open building footprint data in a GeoPackage.↩︎

  3. For South Asia, you can only download the complete data set, so there is no immediate way to compare the OSM and MS roads layers.↩︎