OpenStreetMap is a remarkable project. Before its inception, geoinformation production was an activity exclusive of governmental and comercial institutions; as such, users used to face limitations in the way of licenses or costs. Since 2004, however, there is a platform —powered by the very technologies that made it possible to freely share data through the internet— which gathers and delivers geoinformation in a free, ubiquitous manner.

Nevertheless, OpenStreetMap usage comes with certain inconvenients which stem from the very nature of the project. One such inconvenient is trying to know its present status: since the database is constantly changing —with aditions and modifications made by users all around the world— it is difficult to know, for instance, how many roads have been registered in the map so far. In any case, the webpage OpenStreetMap Wiki Stats presents many resources (some of them are updated daily) that allow having an idea of the current condition of the project.

While I was working on my undergraduate thesis, in some moment I needed a specific resource: a choropleth map of OpenStreetMap elements by country; that is, a map where every country is colored according to how many elements have been contributed within its territory. To understand how that kind of map would look like, I am sharing one that I found in one of my references (Jokar Arsanjani et al. 2015):

OpenStreetMap nodes created by country in October 2014

OpenStreetMap nodes created by country in October 2014; via Springer International Publishing

This map may only count the nodes created in a single month but, even so, it properly depicts the global panorama of OpenStreetMap contributions: the most active communities are in european contries; Canada, United States, Brazil, Russia and Australia recive huge amounts of contributions, because of their extense territories; and the contributions levels greatly vary in Latin America, Africa and Asia.

It is important to note that, in OpenStreetMap, there are three basic elements: nodes, ways and relations. Both ways and relations are ultimately composed of nodes; therefore, a map counting nodes would be accurate in representing the distribution of contributions. One way of producing that map is to download the data for each country —Geofabrik is an example of a webpage that has country wise extracts— and to add up the nodes in each one, but that would be a exhausting process. That is the motivation for introducing the following methodology to estimate the number of nodes per country, applying web scraping, a data mining technique which uses software to extract a webpage's content.

The data source is the website OSMstats developed by Pascal Neis, a person whose research about OpenStreetMap (Neis and Zipf 2012) also was significant in my thesis. This website offers statistics about different aspects of the project, with a daily periodicity since November 1, 2011; particularly, the Countries tab contains estimates of the number of contributors and created / modified / deleted nodes in each country, each day. The first step in this methodology is to write an extraction function; the packages httr, xml2 and purrr were used:

library(purrr)

get_day = possibly(function(day, url){
   
   message("Day: ", day)
   
   df = paste0(url, day) %>%
      httr::GET() %>%
      httr::content(encoding = "UTF-8") %>%
      xml2::as_list() %>%
      plucking() %>%
      map(framing) %>%
      reduce(rbind)
   
   df$date = day
   
   return(df)}, NULL)

The functions plucking() y framing() do not belong to any package; instead, I declared them previously in order to trim and tabulate the data, respectively. The crucial steps are actually between httr::GET() and xml2::as_list() because that is where the scraped webpage's content is extracted and transformed. Additionally, the entire extraction function is inserted in purrr::possibly(); this is a precaution that returns NULL in case an execution fails.

Now this function extracts data for a specific date, but its true usefulness will become apparent when we map (iterate) it through many dates. Since the data start on November 1, 2011, I decided to iterate for 2923 days, until November 1, 2019; as a consequence, the growth of OpenStreetMap during eight years will be measured.

neis = "https://osmstats.neis-one.org/?item=countries&date="

days = as.Date("2011-11-1") + 0:2922

countries = map(days, get_day, url = neis) %>% reduce(rbind)

The obtained data consist of 738001 observations (2917 days times 253 countries or territories). There are six days from the year 2018 which lack data: July 27, August 3 and 24, September 10 and 29, and October 22; we don't know the cause of this phenomenon. The next step is to sum the created nodes —while also subtracting the deleted ones— in each country; this is made possible by using dplyr:

library(dplyr)

countries_nodes = group_by(countries, country) %>%
   mutate(across(created:deleted, as.double)) %>%
   summarise(nodes = sum(created - deleted), dates = n_distinct(date))

summary(countries_nodes)
##    country              nodes               dates
##  Length:253         Min.   :       -2   Min.   :2917
##  Class :character   1st Qu.:   259873   1st Qu.:2917
##  Mode  :character   Median :  2756173   Median :2917
##                     Mean   : 16590188   Mean   :2917
##                     3rd Qu.: 13561870   3rd Qu.:2917
##                     Max.   :388753376   Max.   :2917

Finally, in order to generate the map, the values computed in the previous step must be joined to geometries that represent the countries. The World dataset contains simplified polygons for worldwide territories; since this dataset has class sf, the package of the same name is required to perform an adequate join. Then, the map can be plotted with ggplot2:

library(sf)
library(ggplot2)

data("World", package = "tmap")

left_join(World, countries_nodes, by = c("name" = "country")) %>%
   ggplot(aes(fill = nodes)) + geom_sf(color = NA) + theme_minimal()
First attempt to generate the map

Figure 1: First attempt to generate the map

In this first attempt it was detected that some countries appear with no data, for example the two Republics of Korea and the two of Congo. This problem is caused by the different names with which these countries figure in World and in OSMstats; to solve this, we first should find out which names are those, an we can do so by executing setdiff(World$name, countries_nodes$country):

##  [1] "Fr. S. Antarctic Lands" "Bahamas"                "Bosnia and Herz."
##  [4] "Central African Rep."   "Cote d'Ivoire"          "Dem. Rep. Congo"
##  [7] "Congo"                  "N. Cyprus"              "Czech Rep."
## [10] "Dominican Rep."         "Falkland Is."           "Gambia"
## [13] "Eq. Guinea"             "Korea"                  "Kosovo"
## [16] "Lao PDR"                "Myanmar"                "Dem. Rep. Korea"
## [19] "Palestine"              "W. Sahara"              "S. Sudan"
## [22] "Solomon Is."            "Somaliland"             "Timor-Leste"
## [25] "Tanzania"

Next, the names obtained from OSMstats have to be modified to coincide with the ones above.

countries_nodes = mutate(countries_nodes, across(country, ~case_when(
   . == "Bosnia and Herzegovina"              ~ "Bosnia and Herz.",
   . == "Central African Republic"            ~ "Central African Rep.",
   . == "Congo-Brazzaville"                   ~ "Congo",
   . == "Congo-Kinshasa"                      ~ "Dem. Rep. Congo",
   . == "Czech Republic"                      ~ "Czech Rep.",
   . == "Dominican Republic"                  ~ "Dominican Rep.",
   . == "Equatorial Guinea"                   ~ "Eq. Guinea",
   . == "Falkland Islands (Islas Malvinas)"   ~ "Falkland Is.",
   . == "French Southern and Antarctic Lands" ~ "Fr. S. Antarctic Lands",
   . == "Ivory Coast"                         ~ "Cote d'Ivoire",
   . == "Laos"                                ~ "Lao PDR",
   . == "Myanmar (Burma)"                     ~ "Myanmar",
   . == "North Korea"                         ~ "Dem. Rep. Korea",
   . == "Republic of Kosovo"                  ~ "Kosovo",
   . == "Solomon Islands"                     ~ "Solomon Is.",
   . == "South Korea"                         ~ "Korea",
   . == "South Sudan"                         ~ "S. Sudan",
   . == "The Bahamas"                         ~ "Bahamas",
   . == "The Gambia"                          ~ "Gambia",
   . == "United Republic of Tanzania"         ~ "Tanzania",
   . == "Western Sahara"                      ~ "W. Sahara",
   TRUE ~ .)))

setdiff(World$name, countries_nodes$country)
## [1] "N. Cyprus"   "Palestine"   "Somaliland"  "Timor-Leste"

Northern Cyprus, Palestine, Somaliland and Timor-Leste are countries that, perhaps due to limited recognition, do not appear in OSMstats and evidently they will remain with no data in the map. On the other hand, as many as 80 countries that do appear in OSMstats, are missing in World because they have scarce extension: Andorra, Singapore and the Lesser Antilles are some examples of this situation. As a result, from the total 4197.32 million obtained nodos, 4180.61 million are represented in the map, that is the 99.6 %. Having stated that, we can join again the values and plot the map with certain improvements.

Second attempt to generate the map 🎉

Figure 2: Second attempt to generate the map 🎉

To conclude this post, I think it is necessary to emphasize that OpenStreetMap started in 2004; hence this methodology is unable to count all the existing nodes. Then, how many are missing? Well, the OSMstats website reported that on November 1, 2019, there were 5559.59 million nodes in the database; if that is the case, this methodology counted up to 75.49 % of all the nodes. The same website reported 1248.29 million nodes on November 1, 2011 (that is 22.45 % of the 2019 value). Adding the two percentages we get 97.94 % and that is the amount of OpenStreetMap nodes whose existence we could measure.

By applying web scraping on Pascal Neis' website, not only did I plot the choropleth map I needed; I also generated a big, interesting dataset. Of course, in future posts I will further use it, so I can learn more about this wonderful project. For now, let's find out what are the top ten countries with more nodes:

slice_max(countries_nodes, nodes, n = 10)
## # A tibble: 10 x 3
##    country           nodes dates
##    <chr>             <dbl> <int>
##  1 United States 388753376  2917
##  2 Russia        294001095  2917
##  3 Canada        231464569  2917
##  4 France        230694245  2917
##  5 Germany       220406582  2917
##  6 Indonesia     154173689  2917
##  7 Italy         143362158  2917
##  8 Poland        120794989  2917
##  9 Brazil        112646013  2917
## 10 Japan          98296294  2917

Jokar Arsanjani, Jamal, Alexander Zipf, Peter Mooney, and Marco Helbich. 2015. “An Introduction to Openstreetmap in Geographic Information Science: Experiences, Research, and Applications.” In Lecture Notes in Geoinformation and Cartography, 1–15. https://doi.org/10.1007/978-3-319-14280-7_1.

Neis, Pascal, and Alexander Zipf. 2012. “Analyzing the Contributor Activity of a Volunteered Geographic Information Project — the Case of Openstreetmap.” ISPRS International Journal of Geo-Information 1 (December): 146–65. https://doi.org/10.3390/ijgi1020146.