The wcde package allows for R users to easily download
data from the Wittgenstein
Centre for Demography and Human Capital Data Explorer as well as
containing a number of helpful functions for working with education
specific demographic data.
You can install the released version of wcde from CRAN with:
install.packages("wcde")
Install the developmental version with:
library(devtools)
install_github("guyabel/wcde", ref = "main")
The get_wcde() function can be used to download data
from the Wittgenstein Centre Human Capital Data Explorer. It requires
three user inputs
indicator: a short code for the indicator of
interestscenario: a number referring to a SSP narrative, by
default 2 is used (for SSP2)country_code (or country_name):
corresponding to the country of interestlibrary(wcde)
# download education specific tfr data
get_wcde(indicator = "etfr",
         country_name = c("Brazil", "Albania"))
#> # A tibble: 204 × 6
#>    scenario name    country_code education          period     etfr
#>       <dbl> <chr>          <dbl> <chr>              <chr>     <dbl>
#>  1        2 Brazil            76 No Education       2015-2020  2.47
#>  2        2 Albania            8 No Education       2015-2020  1.88
#>  3        2 Brazil            76 Incomplete Primary 2015-2020  2.47
#>  4        2 Albania            8 Incomplete Primary 2015-2020  1.88
#>  5        2 Brazil            76 Primary            2015-2020  2.47
#>  6        2 Albania            8 Primary            2015-2020  1.88
#>  7        2 Brazil            76 Lower Secondary    2015-2020  1.89
#>  8        2 Albania            8 Lower Secondary    2015-2020  1.9 
#>  9        2 Brazil            76 Upper Secondary    2015-2020  1.37
#> 10        2 Albania            8 Upper Secondary    2015-2020  1.57
#> # … with 194 more rows
# download education specific survivorship rates
get_wcde(indicator = "eassr",
         country_name = c("Niger", "Korea"))
#> # A tibble: 8,976 × 8
#>    scenario name              country_code age     sex   education  period eassr
#>       <dbl> <chr>                    <dbl> <chr>   <chr> <chr>      <chr>  <dbl>
#>  1        2 Niger                      562 Newborn Male  No Educat… 2015-…  91.6
#>  2        2 Republic of Korea          410 Newborn Male  No Educat… 2015-…  99.4
#>  3        2 Niger                      562 Newborn Male  Incomplet… 2015-…  92  
#>  4        2 Republic of Korea          410 Newborn Male  Incomplet… 2015-…  99.5
#>  5        2 Niger                      562 Newborn Male  Primary    2015-…  92.5
#>  6        2 Republic of Korea          410 Newborn Male  Primary    2015-…  99.5
#>  7        2 Niger                      562 Newborn Male  Lower Sec… 2015-…  93.4
#>  8        2 Republic of Korea          410 Newborn Male  Lower Sec… 2015-…  99.6
#>  9        2 Niger                      562 Newborn Male  Upper Sec… 2015-…  95.2
#> 10        2 Republic of Korea          410 Newborn Male  Upper Sec… 2015-…  99.7
#> # … with 8,966 more rows
The indicator input must match the short code from the indicator
table. The find_indicator() function can be used to look up
short codes (given in the first column) from the
wic_indicators data frame:
find_indicator(x = "tfr")
#> # A tibble: 2 × 3
#>   indicator description                       definition                        
#>   <chr>     <chr>                             <chr>                             
#> 1 tfr       Total Fertility Rate              "The average number of children b…
#> 2 etfr      Total Fertility Rate by Education "The average number of children b…
By default, get_wdce() returns data for all years or
available periods or years. The filter() function in dplyr can
be used to filter data for specific years or periods, for example:
library(tidyverse)
get_wcde(indicator = "e0",
         country_name = c("Japan", "Australia")) %>%
  filter(period == "2015-2020")
#> # A tibble: 4 × 6
#>   scenario name      country_code sex    period       e0
#>      <dbl> <chr>            <dbl> <chr>  <chr>     <dbl>
#> 1        2 Japan              392 Male   2015-2020  80.7
#> 2        2 Australia           36 Male   2015-2020  81.3
#> 3        2 Japan              392 Female 2015-2020  87.2
#> 4        2 Australia           36 Female 2015-2020  85
get_wcde(indicator = "sexratio",
         country_name = c("China", "South Korea")) %>%
  filter(year == 2020)
#> # A tibble: 44 × 6
#>    scenario name              country_code age     year sexratio
#>       <dbl> <chr>                    <dbl> <chr>  <dbl>    <dbl>
#>  1        2 China                      156 All     2020     1.06
#>  2        2 Republic of Korea          410 All     2020     1   
#>  3        2 China                      156 0--4    2020     1.15
#>  4        2 Republic of Korea          410 0--4    2020     1.07
#>  5        2 China                      156 5--9    2020     1.16
#>  6        2 Republic of Korea          410 5--9    2020     1.07
#>  7        2 China                      156 10--14  2020     1.17
#>  8        2 Republic of Korea          410 10--14  2020     1.07
#>  9        2 China                      156 15--19  2020     1.16
#> 10        2 Republic of Korea          410 15--19  2020     1.1 
#> # … with 34 more rows
Past data is only available for selected indicators. These can be
viewed using the past indicator column:
wic_indicators %>%
  filter(past) %>%
  select(1:2)
#> # A tibble: 28 × 2
#>    indicator description                                     
#>    <chr>     <chr>                                           
#>  1 pop       Population Size (000's)                         
#>  2 bpop      Population Size by Broad Age (000's)            
#>  3 epop      Population Size by Education (000's)            
#>  4 prop      Educational Attainment Distribution             
#>  5 bprop     Educational Attainment Distribution by Broad Age
#>  6 growth    Average Annual Growth Rate                      
#>  7 nirate    Average Annual Rate of Natural Increase         
#>  8 sexratio  Sex Ratio                                       
#>  9 mage      Population Median Age                           
#> 10 tdr       Total Dependency Ratio                          
#> # … with 18 more rows
The filter() function can also be used to filter
specific indicators to specific age, sex or education groups
get_wcde(indicator = "sexratio",
         country_name = c("China", "South Korea")) %>%
  filter(year == 2020,
         age == "All")
#> # A tibble: 2 × 6
#>   scenario name              country_code age    year sexratio
#>      <dbl> <chr>                    <dbl> <chr> <dbl>    <dbl>
#> 1        2 China                      156 All    2020     1.06
#> 2        2 Republic of Korea          410 All    2020     1
Country names are guessed using the countrycode package.
get_wcde(indicator = "tfr",
         country_name = c("U.A.E", "Espania", "Österreich"))
#> # A tibble: 90 × 5
#>    scenario name                 country_code period      tfr
#>       <dbl> <chr>                       <dbl> <chr>     <dbl>
#>  1        2 United Arab Emirates          784 1950-1955  6.97
#>  2        2 Spain                         724 1950-1955  2.53
#>  3        2 Austria                        40 1950-1955  2.1 
#>  4        2 United Arab Emirates          784 1955-1960  6.97
#>  5        2 Spain                         724 1955-1960  2.7 
#>  6        2 Austria                        40 1955-1960  2.57
#>  7        2 United Arab Emirates          784 1960-1965  6.87
#>  8        2 Spain                         724 1960-1965  2.81
#>  9        2 Austria                        40 1960-1965  2.78
#> 10        2 United Arab Emirates          784 1965-1970  6.77
#> # … with 80 more rows
The get_wcde() functions accepts ISO alpha numeric codes
for countries via the country_code argument:
get_wcde(indicator = "etfr", country_code = c(44, 100))
#> # A tibble: 204 × 6
#>    scenario name     country_code education          period     etfr
#>       <dbl> <chr>           <dbl> <chr>              <chr>     <dbl>
#>  1        2 Bahamas            44 No Education       2015-2020  2.71
#>  2        2 Bulgaria          100 No Education       2015-2020  1.72
#>  3        2 Bahamas            44 Incomplete Primary 2015-2020  2.71
#>  4        2 Bulgaria          100 Incomplete Primary 2015-2020  1.72
#>  5        2 Bahamas            44 Primary            2015-2020  2.71
#>  6        2 Bulgaria          100 Primary            2015-2020  1.72
#>  7        2 Bahamas            44 Lower Secondary    2015-2020  2.09
#>  8        2 Bulgaria          100 Lower Secondary    2015-2020  1.73
#>  9        2 Bahamas            44 Upper Secondary    2015-2020  1.76
#> 10        2 Bulgaria          100 Upper Secondary    2015-2020  1.44
#> # … with 194 more rows
A full list of available countries and region aggregates, and their
codes, can be found in the wic_locations data frame.
wic_locations
#> # A tibble: 230 × 5
#>    name                            isono continent region             dim    
#>    <chr>                           <dbl> <chr>     <chr>              <chr>  
#>  1 World                             900 <NA>      <NA>               area   
#>  2 Africa                            903 <NA>      <NA>               area   
#>  3 Asia                              935 <NA>      <NA>               area   
#>  4 Europe                            908 <NA>      <NA>               area   
#>  5 Latin America and the Caribbean   904 <NA>      <NA>               area   
#>  6 Northern America                  905 <NA>      <NA>               area   
#>  7 Oceania                           909 <NA>      <NA>               area   
#>  8 Afghanistan                         4 Asia      South-Central Asia country
#>  9 Albania                             8 Europe    Southern Europe    country
#> 10 Algeria                            12 Africa    Northern Africa    country
#> # … with 220 more rows
By default get_wcde() returns data for Medium (SSP2)
scenario. Results for different SSP scenarios can be returned by passing
a different (or multiple) scenario values to the scenario
argument in get_data().
get_wcde(indicator = "growth",
         country_name = c("India", "China"),
         scenario = c(1:3, 21, 22)) %>%
  filter(period == "2095-2100")
#> # A tibble: 10 × 5
#>    scenario name  country_code period    growth
#>       <dbl> <chr>        <dbl> <chr>      <dbl>
#>  1        1 India          356 2095-2100   -0.7
#>  2        1 China          156 2095-2100   -1.1
#>  3        2 India          356 2095-2100   -0.5
#>  4        2 China          156 2095-2100   -1  
#>  5        3 India          356 2095-2100    0.2
#>  6        3 China          156 2095-2100   -0.2
#>  7       21 India          356 2095-2100   -0.5
#>  8       21 China          156 2095-2100   -0.9
#>  9       22 India          356 2095-2100   -0.5
#> 10       22 China          156 2095-2100   -1
Set include_scenario_names = TRUE to include a columns
with the full names of the scenarios
get_wcde(indicator = "tfr",
         country_name = c("Kenya", "Nigeria", "Algeria"),
         scenario = 1:3,
         include_scenario_names = TRUE) %>%
  filter(period == "2045-2050")
#> # A tibble: 9 × 7
#>   scenario scenario_name            scenario_abb name  country_code period   tfr
#>      <dbl> <chr>                    <chr>        <chr>        <dbl> <chr>  <dbl>
#> 1        1 Rapid Development (SSP1) SSP1         Kenya          404 2045-…  1.62
#> 2        1 Rapid Development (SSP1) SSP1         Nige…          566 2045-…  2.29
#> 3        1 Rapid Development (SSP1) SSP1         Alge…           12 2045-…  1.53
#> 4        2 Medium (SSP2)            SSP2         Kenya          404 2045-…  2.36
#> 5        2 Medium (SSP2)            SSP2         Nige…          566 2045-…  3.37
#> 6        2 Medium (SSP2)            SSP2         Alge…           12 2045-…  1.77
#> 7        3 Stalled Development (SS… SSP3         Kenya          404 2045-…  3.33
#> 8        3 Stalled Development (SS… SSP3         Nige…          566 2045-…  4.65
#> 9        3 Stalled Development (SS… SSP3         Alge…           12 2045-…  2.41
Additional details of the pathways for each scenario numeric code can
be found in the wic_scenarios object. Further background
and links to the corresponding literature are provided in the Data
Explorer
wic_scenarios
#> # A tibble: 5 × 3
#>   scenario_name                         scenario scenario_abb
#>   <chr>                                    <dbl> <chr>       
#> 1 Rapid Development (SSP1)                     1 SSP1        
#> 2 Medium (SSP2)                                2 SSP2        
#> 3 Stalled Development (SSP3)                   3 SSP3        
#> 4 Medium - Zero Migration (SSP2 - ZM)         21 SSP2ZM      
#> 5 Medium - Double Migration (SSP2 - DM)       22 SSP2DM
Data for all countries can be obtained by not setting
country_name or country_code
get_wcde(indicator = "mage")
#> # A tibble: 7,099 × 5
#>    scenario name                     country_code  year  mage
#>       <dbl> <chr>                           <dbl> <dbl> <dbl>
#>  1        2 Bulgaria                          100  1950  27.3
#>  2        2 Myanmar                           104  1950  22.8
#>  3        2 Burundi                           108  1950  19.5
#>  4        2 Belarus                           112  1950  27.2
#>  5        2 Cambodia                          116  1950  18.7
#>  6        2 Algeria                            12  1950  19.4
#>  7        2 Cameroon                          120  1950  20.8
#>  8        2 Canada                            124  1950  27.7
#>  9        2 Cape Verde                        132  1950  23  
#> 10        2 Central African Republic          140  1950  22.5
#> # … with 7,089 more rows
The get_wdce() function needs to be called multiple
times to download multiple indicators. This can be done using the
map() function in purrr
mi <- tibble(ind = c("odr", "nirate", "ggapedu25")) %>%
  mutate(d = map(.x = ind, .f = ~get_wcde(indicator = .x)))
mi
#> # A tibble: 3 × 2
#>   ind       d                    
#>   <chr>     <list>               
#> 1 odr       <tibble [7,099 × 5]> 
#> 2 nirate    <tibble [6,870 × 5]> 
#> 3 ggapedu25 <tibble [41,346 × 6]>
mi %>%
  filter(ind == "odr") %>%
  select(-ind) %>%
  unnest(cols = d)
#> # A tibble: 7,099 × 5
#>    scenario name                     country_code  year   odr
#>       <dbl> <chr>                           <dbl> <dbl> <dbl>
#>  1        2 Bulgaria                          100  1950  0.1 
#>  2        2 Myanmar                           104  1950  0.05
#>  3        2 Burundi                           108  1950  0.06
#>  4        2 Belarus                           112  1950  0.13
#>  5        2 Cambodia                          116  1950  0.05
#>  6        2 Algeria                            12  1950  0.06
#>  7        2 Cameroon                          120  1950  0.06
#>  8        2 Canada                            124  1950  0.12
#>  9        2 Cape Verde                        132  1950  0.13
#> 10        2 Central African Republic          140  1950  0.09
#> # … with 7,089 more rows
mi %>%
  filter(ind == "nirate") %>%
  select(-ind) %>%
  unnest(cols = d)
#> # A tibble: 6,870 × 5
#>    scenario name                     country_code period    nirate
#>       <dbl> <chr>                           <dbl> <chr>      <dbl>
#>  1        2 Bulgaria                          100 1950-1955   11.1
#>  2        2 Myanmar                           104 1950-1955   19.1
#>  3        2 Burundi                           108 1950-1955   24.1
#>  4        2 Belarus                           112 1950-1955   10.1
#>  5        2 Cambodia                          116 1950-1955   25.9
#>  6        2 Algeria                            12 1950-1955   27.1
#>  7        2 Cameroon                          120 1950-1955   17.6
#>  8        2 Canada                            124 1950-1955   18.9
#>  9        2 Cape Verde                        132 1950-1955   26.9
#> 10        2 Central African Republic          140 1950-1955   10.7
#> # … with 6,860 more rows
mi %>%
  filter(ind == "ggapedu25") %>%
  select(-ind) %>%
  unnest(cols = d)
#> # A tibble: 41,346 × 6
#>    scenario name                     country_code  year education    ggapedu25
#>       <dbl> <chr>                           <dbl> <dbl> <chr>            <dbl>
#>  1        2 Bulgaria                          100  1950 No Education       -20
#>  2        2 Myanmar                           104  1950 No Education       -13
#>  3        2 Burundi                           108  1950 No Education        -6
#>  4        2 Belarus                           112  1950 No Education       -10
#>  5        2 Cambodia                          116  1950 No Education       -21
#>  6        2 Algeria                            12  1950 No Education        -2
#>  7        2 Cameroon                          120  1950 No Education       -13
#>  8        2 Canada                            124  1950 No Education        -2
#>  9        2 Cape Verde                        132  1950 No Education        -9
#> 10        2 Central African Republic          140  1950 No Education        -1
#> # … with 41,336 more rows
Population data for a range of age-sex-educational attainment
combinations can be obtained by setting indicator = "pop"
in get_wcde() and specifying a pop_age,
pop_sex and pop_edu arguments. By default each
of the three population breakdown arguments are set to “total”
get_wcde(indicator = "pop", country_name = "India")
#> # A tibble: 31 × 5
#>    scenario name  country_code  year     pop
#>       <dbl> <chr>        <dbl> <dbl>   <dbl>
#>  1        2 India          356  1950 376325.
#>  2        2 India          356  1955 409276.
#>  3        2 India          356  1960 449604.
#>  4        2 India          356  1965 497830.
#>  5        2 India          356  1970 553787.
#>  6        2 India          356  1975 621525.
#>  7        2 India          356  1980 697040.
#>  8        2 India          356  1985 781904.
#>  9        2 India          356  1990 870422.
#> 10        2 India          356  1995 960733.
#> # … with 21 more rows
The pop_age argument can be set to all to
get population data broken down in five-year age groups. The
pop_sex argument can be set to both to get
population data broken down into female and male groups. The
pop_edu argument can be set to four,
six or eight to get population data broken
down into education categorizations with different levels of detail.
get_wcde(indicator = "pop", country_code = 900, pop_edu = "four")
#> # A tibble: 155 × 6
#>    scenario name  country_code  year education          pop
#>       <dbl> <fct>        <dbl> <dbl> <fct>            <dbl>
#>  1        2 World          900  1950 Under 15       868844.
#>  2        2 World          900  1950 No Education   763612.
#>  3        2 World          900  1950 Primary        549510.
#>  4        2 World          900  1950 Secondary      329182.
#>  5        2 World          900  1950 Post Secondary  30143.
#>  6        2 World          900  1955 Under 15       984764.
#>  7        2 World          900  1955 No Education   762022.
#>  8        2 World          900  1955 Primary        600299.
#>  9        2 World          900  1955 Secondary      392261.
#> 10        2 World          900  1955 Post Secondary  38199.
#> # … with 145 more rows
The population breakdown arguments can be used in combination to provide further breakdowns, for example sex and education specific population totals
get_wcde(indicator = "pop", country_code = 900, pop_edu = "six", pop_sex = "both")
#> # A tibble: 434 × 7
#>    scenario name  country_code  year sex    education              pop
#>       <dbl> <fct>        <dbl> <dbl> <fct>  <fct>                <dbl>
#>  1        2 World          900  1950 Male   Under 15           443968.
#>  2        2 World          900  1950 Male   No Education       317636.
#>  3        2 World          900  1950 Male   Incomplete Primary 116692.
#>  4        2 World          900  1950 Male   Primary            194902 
#>  5        2 World          900  1950 Male   Lower Secondary    104160 
#>  6        2 World          900  1950 Male   Upper Secondary     69384.
#>  7        2 World          900  1950 Male   Post Secondary      21102.
#>  8        2 World          900  1950 Female Under 15           424877.
#>  9        2 World          900  1950 Female No Education       445976.
#> 10        2 World          900  1950 Female Incomplete Primary  81231.
#> # … with 424 more rows
The full age-sex-education specific data can also be obtained by
setting indicator = "epop" in get_wcde().
Create population pyramids by setting male population values to negative equivalent to allow for divergent columns from the y axis.
w <- get_wcde(indicator = "pop", country_code = 900,
              pop_age = "all", pop_sex = "both", pop_edu = "four")
w
#> # A tibble: 6,510 × 8
#>    scenario name  country_code  year age   sex    education          pop
#>       <dbl> <fct>        <dbl> <dbl> <fct> <fct>  <fct>            <dbl>
#>  1        2 World          900  1950 0--4  Male   Under 15       172362.
#>  2        2 World          900  1950 0--4  Male   No Education        0 
#>  3        2 World          900  1950 0--4  Male   Primary             0 
#>  4        2 World          900  1950 0--4  Male   Secondary           0 
#>  5        2 World          900  1950 0--4  Male   Post Secondary      0 
#>  6        2 World          900  1950 0--4  Female Under 15       166026.
#>  7        2 World          900  1950 0--4  Female No Education        0 
#>  8        2 World          900  1950 0--4  Female Primary             0 
#>  9        2 World          900  1950 0--4  Female Secondary           0 
#> 10        2 World          900  1950 0--4  Female Post Secondary      0 
#> # … with 6,500 more rows
w <- w %>%
  mutate(pop_pm = ifelse(test = sex == "Male", yes = -pop, no = pop),
         pop_pm = pop_pm/1e3)
w
#> # A tibble: 6,510 × 9
#>    scenario name  country_code  year age   sex    education          pop pop_pm
#>       <dbl> <fct>        <dbl> <dbl> <fct> <fct>  <fct>            <dbl>  <dbl>
#>  1        2 World          900  1950 0--4  Male   Under 15       172362.  -172.
#>  2        2 World          900  1950 0--4  Male   No Education        0      0 
#>  3        2 World          900  1950 0--4  Male   Primary             0      0 
#>  4        2 World          900  1950 0--4  Male   Secondary           0      0 
#>  5        2 World          900  1950 0--4  Male   Post Secondary      0      0 
#>  6        2 World          900  1950 0--4  Female Under 15       166026.   166.
#>  7        2 World          900  1950 0--4  Female No Education        0      0 
#>  8        2 World          900  1950 0--4  Female Primary             0      0 
#>  9        2 World          900  1950 0--4  Female Secondary           0      0 
#> 10        2 World          900  1950 0--4  Female Post Secondary      0      0 
#> # … with 6,500 more rows
Use standard ggplot code to create population pyramid with
scale_x_symmetric() from the lemon
package to allow for equal male and female x-axiswic_col4 object in the wcde
package which contains the names of the colours used in the Wittgenstein
Centre Human Capital Data Explorer Data Explorer.Note wic_col6 and wic_col8 objects also
exist for equivalent plots of population data objects with corresponding
numbers of categories of education.
library(lemon)
w %>%
  filter(year == 2020) %>%
  ggplot(mapping = aes(x = pop_pm, y = age, fill = fct_rev(education))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black") +
  scale_x_symmetric(labels = abs) +
  scale_fill_manual(values = wic_col4, name = "Education") +
  labs(x = "Population (millions)", y = "Age") +
  theme_bw()
Add male and female labels on the x-axis by
geom_blank() to allow for equal x-axis and
additional space at the end of largest columns.w <- w %>%
  mutate(pop_max = ifelse(sex == "Male", -max(pop/1e3), max(pop/1e3)))
w %>%
  filter(year == 2020) %>%
  ggplot(mapping = aes(x = pop_pm, y = age, fill = fct_rev(education))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black") +
  scale_x_continuous(labels = abs, expand = c(0, 0)) +
  scale_fill_manual(values = wic_col4, name = "Education") +
  labs(x = "Population (millions)", y = "Age") +
  facet_wrap(facets = "sex", scales = "free_x", strip.position = "bottom") +
  geom_blank(mapping = aes(x = pop_max * 1.1)) +
  theme(panel.spacing.x = unit(0, "pt"),
        strip.placement = "outside",
        strip.background = element_rect(fill = "transparent"),
        strip.text.x = element_text(margin = margin( b = 0, t = 0)))
Animate the pyramid through the past data and projection periods
using the transition_time() function in the gganimate
package
library(gganimate)
ggplot(data = w,
       mapping = aes(x = pop_pm, y = age, fill = fct_rev(education))) +
  geom_col() +
  geom_vline(xintercept = 0, colour = "black") +
  scale_x_continuous(labels = abs, expand = c(0, 0)) +
  scale_fill_manual(values = wic_col4, name = "Education") +
  facet_wrap(facets = "sex", scales = "free_x", strip.position = "bottom") +
  geom_blank(mapping = aes(x = pop_max * 1.1)) +
  theme(panel.spacing.x = unit(0, "pt"),
        strip.placement = "outside",
        strip.background = element_rect(fill = "transparent"),
        strip.text.x = element_text(margin = margin(b = 0, t = 0))) +
  transition_time(time = year) +
  labs(x = "Population (millions)", y = "Age",
       title = 'SSP2 World Population {round(frame_time)}')