The daily growth of OSM nodes
In my first blog post, I presented how to create a choropleth map of OpenStreetMap
contributions; by using the web scraping technique, I was able to count up to 75 % of
all the existing nodes, a good initial estimate. Even so, just summing the nodes in each
country is not a complete exploitation of the data; after all, we are talking about 738 thousand
observations. So now we are going to estimate the rate of growth of the contributions in each
country. We will use tidyverse
to simultaneously load several packages that we need.
library(sf)
library(tidyverse) # loads dplyr, ggplot2, purrr, tidyr
data("World", package = "tmap")
countries = select(st_drop_geometry(World), country = name, continent) %>%
left_join(countries) %>% # countries dataset as created in my first post
filter(date != as.Date("2019-11-01"))
In this first step I discarded the territories that cannot be represented in the map (reducing them from 253 to 173), and I also added the continent to which each one belongs. Before computing the rate of growth, a valid argument is that we have more observations than needed, because of the daily periodicity. One strategy to reduce the amount of observations is to aggregate (group) them into a less frequent periodicity; therefore, in the next step, I compute the sum1 and the cumulative sum of created nodes, by the end of each month.
countries_month = mutate(countries, across(date, lubridate::ceiling_date, "month")) %>%
group_by(continent, country, date) %>%
summarise(nodes_sum = sum(created)) %>%
mutate(nodes_cum = cumsum(nodes_sum), date = date - 1) # last day from each month
head(countries_month)
## # A tibble: 6 x 5
## # Groups: continent, country [1]
## continent country date nodes_sum nodes_cum
## <fct> <chr> <date> <dbl> <dbl>
## 1 Africa Algeria 2011-11-30 45598 45598
## 2 Africa Algeria 2011-12-31 60887 106485
## 3 Africa Algeria 2012-01-31 46259 152744
## 4 Africa Algeria 2012-02-29 33228 185972
## 5 Africa Algeria 2012-03-31 24461 210433
## 6 Africa Algeria 2012-04-30 15374 225807
After the aggregation there are 96 monthly observations for each territory. This amount of data is now adequate to fit a model, but it is still too large to plot. For the graphic below I aggregated the data even more: I computed the mean monthly created nodes in each continent2, except Antarctica.
Here we can verify that the greatest levels of contribution occur in european countries. And while in Africa, Asia and South America participation has historically been lesser, it has increased through the years. Now, if we wanted to plot the evolution for every country, that would be an illegible graphic. Instead, let's select, from every continent, the two countries with the most nodes; in decreasing order, these countries are:
- United States ๐บ๐ธ
- Russia ๐ท๐บ
- Canada ๐จ๐ฆ
- France ๐ซ๐ท
- Indonesia ๐ฎ๐ฉ
- Japan ๐ฏ๐ต
- Brazil ๐ง๐ท
- Tanzania ๐น๐ฟ
- Australia ๐ฆ๐บ
- Nigeria ๐ณ๐ฌ
- New Zealand ๐ณ๐ฟ
- Argentina ๐ฆ๐ท
As expected, european and north american countries are where created nodes reach the highest values; there were few times when they were surpassed by a country from another continent. If we look to cumulative sums, rather than just monthly sums, we will realize that they seem to have a linear evolution, with United States having the steepest slope, followed by a group comprised of Russia, Canada, France. Arguably in most projects, quantities such as number of members or contributions will always have linear or exponential growth; whether it is the former or the latter, it depends on how the project operates.
Prior research in OpenStreetMap have already revealed a linear behavior. Neis, Zielstra, and Zipf (2013), while comparing the contributions in twelve big cities around the globe, observed a linear increase of contributing members from January 2007 and September 2012; this was discovered when they normalized the number of members with respect to the total population or the area of the city. The cities, in decreasing order of growth of contributors, were:
- Berlin (๐ฉ๐ช)
- Paris (๐ซ๐ท)
- Moscow (๐ท๐บ)
- London (๐ฌ๐ง)
- Los Angeles (๐บ๐ธ)
- Sydney (๐ฆ๐บ)
- Johannesburg (๐ฟ๐ฆ)
- Buenos Aires (๐ฆ๐ท)
- Osaka (๐ฏ๐ต)
- Istanbul (๐น๐ท)
- Seoul (๐ฐ๐ท)
- Cairo (๐ช๐ฌ)
Again, the european cities occupy the first places, and then one city from North America. A similiar order appears in the graphic below (Neis, Zielstra, and Zipf 2013), which depicts the number of objects -normalized with respect to the area- that existed in those cities.
With these antecedents, I decided to fit a linear regression3
for each country; the cumulative sum of nodes will be the output variable, and time will be the
input, along the eight years (from November 1, 2011 to October 31, 2019) worth of observations.
A tidy way to do this task is to tidyr::nest()
(encase) the observations by
country, and then execute lm()
in each nested group.
countries_model = nest(countries_month) %>%
mutate(model = map(data, ~lm(nodes_cum ~ date, data = .)),
model = map(model, ~c(coef(.), summary(.)$r.squared)),
name = list(c("intercept", "rate", "rsquared"))) %>%
unnest(cols = c(model, name)) %>%
pivot_wider(values_from = model)
head(countries_model)
## # A tibble: 6 x 6
## # Groups: continent, country [6]
## continent country data intercept rate rsquared
## <fct> <chr> <list> <dbl> <dbl> <dbl>
## 1 Africa Algeria <tibble [96 x 3]> -65540524. 4211. 0.982
## 2 Africa Angola <tibble [96 x 3]> -38861484. 2443. 0.740
## 3 Africa Benin <tibble [96 x 3]> -28410417. 1791. 0.909
## 4 Africa Botswana <tibble [96 x 3]> -56111053. 3536. 0.898
## 5 Africa Burkina Faso <tibble [96 x 3]> -46220666. 2921. 0.927
## 6 Africa Burundi <tibble [96 x 3]> -13142257. 843. 0.857
To illustrate the results, here are the regressions for the previously selected countries:
In most cases a fair fit was achieved, according to the determination coefficient R2; the mean of this coefficient across the 173 territories was 0.8957. Only in nine countries (their regressions are plotted below) the R2 coefficient was less than 0.70; probably those countries were recently recognized in our data source, or received contributions in a sudden manner.
For those territories that had a good fit, the regression's slope will be the estimated parameter that characterizes the growth of contributions; this value represents the daily4 rate of nodes creation. The intercept, on the other hand, is not relevant in this analysis. The mean slope was 9275 and the maximum was 159555 daily nodes; the latter corresponds to the United States and is an outlier.
How can we measure the accuracy of these estimates? Neis and Zipf (2012) reported that, in January 2012, the global rate of contributions was about 1.2 million nodes, 130 thousand ways and 1500 relations created every day. If we add all the slopes that we estimated, a global rate of 1.605 million daily nodes is obtained; this is greater than the rate reported in 2012, which makes sense since now there are more contributors.
To finish this post I present the choropleth maps of slopes (which we already know represent the rates of growth of nodes) and R2 coefficients, of the fitted regressions for each country.
Neis, Pascal, Dennis Zielstra, and Alexander Zipf. 2013. โComparison of Volunteered Geographic Information Data Contributions and Community Development for Selected World Regions.โ Future Internet 5 (June): 282โ300. https://doi.org/10.3390/fi5020282.
Neis, Pascal, and Alexander Zipf. 2012. โAnalyzing the Contributor Activity of a Volunteered Geographic Information Project โ the Case of Openstreetmap.โ ISPRS International Journal of Geo-Information 1 (December): 146โ65. https://doi.org/10.3390/ijgi1020146.
I simply computed the sum of created nodes, without subtracting the deleted ones. As a consequence, the rates of growth I present by the end of this post cannot be used to compute the final amount of nodes in any country.โฉ๏ธ
Bear in mind that the number of countries amptly varies among continents, from 7 in Oceania, to 50 in Africa.โฉ๏ธ
Since the initial amount of nodes is zero, it makes sense to compute a regression through the origin (one where the intercept is fixed at zero). That option was assessed too, but it did not produce better fits, compared to the typical linear regression. Seems like this is one of those cases where the simpler model is adequate.โฉ๏ธ
It is true that the data became monthly but, due to the way R handles dates, the rate will indicate the change in nodes when the date increases by one day.โฉ๏ธ