District zone IDs in municipal overnight stays data

tabular data
incorrect data
zone IDs mismatch
importance: medium
Author
Published

July 16, 2024

Modified

August 17, 2024


Status: ⚠️ active

Importance: 2 - medium

Note: According to the official methodology this is by design. However, still needs to be addressed in workflows within the {spanishoddata} R package.

Summary: The zone IDs in the municipal overnight stays data are not consistent with the zone IDs in the municipal boundaries data. The zone IDs in the municipal overnight stays data for the residential location column zona_residencia are a combination of district and municipal IDs.

Expected Results: The zone IDs in the municipal overnight stays data should be consistent with the zone IDs in the municipal boundaries data (unless the use of district IDs in the residential location column zona_residencia is intentional).



Steps to Reproduce

  1. Load Data

Load libraries and define data files.

library(tidyverse)
library(sf)
library(here)
library(DT)


overnight_stays_file <- here("data/raw_data/v2/estudios_basicos/por-municipios/pernoctaciones/ficheros-diarios/2022-01/20220101_Pernoctaciones_municipios.csv.gz")
municipal_boundaries_data_file <- here("data/raw_data/v2/zonificacion/zonificacion_municipios/zonificacion_municipios.shp")
district_boundaries_data_file <- here("data/raw_data/v2/zonificacion/zonificacion_distritos/zonificacion_distritos.shp")

Load the data.

overnight_stays <- readr::read_delim(overnight_stays_file, delim = "|", show_col_types = FALSE, name_repair = "unique_quiet")
# municipal_boundaries <- read_sf(municipal_boundaries_data_file) |> filter(!grepl("PT|FR|externo", ID))
# district_boundaries <- read_sf(district_boundaries_data_file) |> filter(!grepl("PT|FR|externo", ID))
municipal_boundaries <- read_sf(municipal_boundaries_data_file)
district_boundaries <- read_sf(district_boundaries_data_file)
glimpse(overnight_stays)
Rows: 515,748
Columns: 4
$ fecha             <dbl> 20220101, 20220101, 20220101, 20220101, 20220101, 20…
$ zona_residencia   <chr> "01001", "01001", "01001", "01001", "01001", "01001"…
$ zona_pernoctacion <chr> "01001", "01017_AM", "01047_AM", "01051", "01059", "…
$ personas          <dbl> 2447.613, 9.000, 2.514, 5.780, 181.970, 3.266, 3.266…
glimpse(municipal_boundaries)
Rows: 2,735
Columns: 2
$ ID       <chr> "01001", "01002", "01004_AM", "01009_AM", "01010", "01017_AM"…
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((537856.7 47..., MULTIPOLYGON (((…
glimpse(district_boundaries)
Rows: 3,909
Columns: 2
$ ID       <chr> "01001", "01002", "01004_AM", "01009_AM", "01010", "01017_AM"…
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((538090.2 47..., MULTIPOLYGON (((…

Results

  1. Not all residence location IDs in the municipal level overnight stays dataset can be found in the municipal boundaries dataset.

Not all residence location (zona_residencia) IDs in the municipal level overnight stays dataset can be found in the municipal boundaries dataset. Residence locations in municipal level overnight stays uses district IDs in addition to municipal IDs. That is, just using the municipal boundaries dataset is not enough to match all the residence locations in the overnight stays dataset.

sum(!unique(overnight_stays$zona_residencia) %in% unique(municipal_boundaries$ID))
[1] 1596

Meanwhile, all residence location (zona_residencia) IDs in the municipal level overnight stays can be found in the district boundaries dataset.

sum(!unique(overnight_stays$zona_residencia) %in% unique(district_boundaries$ID))
[1] 0

Only municipal IDs are used in the zona_pernoctacion column of the overnight stays dataset.

sum(!unique(overnight_stays$zona_pernoctacion) %in% unique(municipal_boundaries$ID))
[1] 0
  1. Sample of rows with residence location IDs coming from district breakdown in the municipal boundaries overnight stays
DT::datatable(overnight_stays |> filter(!zona_residencia %in% unique(municipal_boundaries$ID)) |> sample_n(100))