Weekly OSM: updates of OSMstats data
When I started this blog, I decided to do so by sharing my first data analysis project: obtaining information about OpenStreetMap (OSM) by means of web scraping in the OSMstats website, developed by Pascal Neis. Since then —while learning more about R and how to properly code— I’ve discovered new methods that would allow me to replicate that project, but in a more concise and reproducible manner.
That’s why I am currently developing a GitHub repository with this purpose: to maintain a database of summarized information from OSMstats. Data would be obtained every week and would show the activity that has happened in all available countries. Before explaining what is the procedure, I must mention that OpenStreetMap —as well as anything derived from it— is open data, is a property of its contributors (© OpenStreetMap contributors), and possesses an ODC Open Database License. The database I will generate will therefore have the same license.
In a nutshell, the procedure I am doing is:
- Create the repository in GitHub and
git clone
it - Transform it to a RStudio project, with a
renv
private environment - Program 😼
- Create the database, with data since the beginning of OSMstats until today
- Update the repository and the database via
git push
- Create the mechanism for the repository to update itself onward
Regarding the first step, the repository Weekly OSM is already online; I chose that name to signify the fact that it will be updated once per week. Such frequency has two advantages: to reduce the granularity of the observations, and to maintain a cache —the service GitHub Actions keeps them one week at most— of required packages.
After cloning the repository, the second step was to launch RStudio and associate the Weekly-OSM
directory with a new project. Once this was done, the first commands I executed in the project were:
install.packages("renv")
renv::init(bare = TRUE)
The purpose of renv
is to generate a reproducible
working environment. What this means is that, during development, my project will have
its own R packages library, isolated from other projects that I may develop in the same
computer. It also means that I can easily reproduce this library in another computer,
which is crucial, since it will allow the execution of scripts from the project as
part of a GitHub Actions workflow.
The third step is, precisely, to write those scripts. In this post I
will not delve into how does Weekly OSM function; that information is in the
README and each of the
scripts, which are thoroughly commented. However I will mention that, after programming,
one should run renv::snapshot()
to write renv.lock
, a file registering all the packages
that have been used —dplyr
and rvest
in my case— and their dependencies too. The
registries it writes look like this:
"magrittr": {
"Package": "magrittr",
"Version": "2.0.1",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "41287f1ac7d28a92f0a286ed507928d3"
}
To accomplish the fourth step I wrote the main_yearly.R
script;
when executed, it obtains from OSMstats all the observations of certain year, i.e. of
every date and every country that had activity during those dates. Then it computes a
weekly summary of the observations —applying functions such as sum()
and max()
—
and writes the results to .csv
files; hence, there will be 52 o 53 files per year. Once
that I ran the script for each year from 2011 to 2021, I added all those files to the
online repository, and so I accomplished the fifth step.
Finally, the sixth step consisted in writing
main.yaml
,
a workflow that will run every Monday at 03:13 UTC approximately, through GitHub Actions.
I won’t talk about workflow syntax, but I do want to show some of the steps; for instance,
the one below serves to create a cache with installed packages. While the first workflow
run took several minutes, thanks to the cache subsequent runs were completed in less than
a minute. Furthermore, as long as renv.lock
doesn’t change, the cache will remain valid.
# create/restore cache
- name: Cache packages
uses: actions/cache@v2
with:
path: ${{ env.RENV_PATHS_ROOT }}
key: renv-${{ hashFiles('renv.lock') }}
In the next step the renv::restore()
function is executed, and it performs
the installation (or, if a cache already exists, the restoration) of the
packages registered in renv.lock
. And the last step I show here executes
main.R
, a script
containing all the operations that need to be done every week in the repository.
# install/restore packages
- name: Restore packages
run: Rscript -e 'renv::restore()'
# source main script
- name: Obtain weekly data
run: Rscript R/main.R
That’s about everything I can say regarding Weekly OSM up to now. Since I added main.yaml
the database has correctly been updated; now I just need to keep in mind that the workflow
must be renewed every now and then, because otherwise GitHub Actions will disable it 😹.
This post will have a continuation, where I will demonstrate how to utilize the database.