gutenbergr: Search and download public domain texts from Project Gutenberg

David Robinson, Myfanwy Johnston

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
library(dplyr)
gutenberg_metadata
#> # A tibble: 69,199 × 8
#>    gutenberg_id title              author guten…¹ langu…² guten…³ rights has_t…⁴
#>           <int> <chr>              <chr>    <int> <chr>   <chr>   <chr>  <lgl>  
#>  1            1 "The Declaration … Jeffe…    1638 en      Politi… Publi… TRUE   
#>  2            2 "The United State… Unite…       1 en      Politi… Publi… TRUE   
#>  3            3 "John F. Kennedy'… Kenne…    1666 en      <NA>    Publi… TRUE   
#>  4            4 "Lincoln's Gettys… Linco…       3 en      US Civ… Publi… TRUE   
#>  5            5 "The United State… Unite…       1 en      United… Publi… TRUE   
#>  6            6 "Give Me Liberty … Henry…       4 en      Americ… Publi… TRUE   
#>  7            7 "The Mayflower Co… <NA>        NA en      <NA>    Publi… TRUE   
#>  8            8 "Abraham Lincoln'… Linco…       3 en      US Civ… Publi… TRUE   
#>  9            9 "Abraham Lincoln'… Linco…       3 en      US Civ… Publi… TRUE   
#> 10           10 "The King James V… <NA>        NA en      Banned… Publi… TRUE   
#> # … with 69,189 more rows, and abbreviated variable names ¹​gutenberg_author_id,
#> #   ²​language, ³​gutenberg_bookshelf, ⁴​has_text

For example, you could find the Gutenberg ID(s) of Jane Austen’s Persuasion by doing:


gutenberg_metadata %>%
  filter(title == "Persuasion")
#> # A tibble: 3 × 8
#>   gutenberg_id title      author       gutenber…¹ langu…² guten…³ rights has_t…⁴
#>          <int> <chr>      <chr>             <int> <chr>   <chr>   <chr>  <lgl>  
#> 1          105 Persuasion Austen, Jane         68 en      <NA>    Publi… TRUE   
#> 2        22963 Persuasion Austen, Jane         68 en      <NA>    Publi… FALSE  
#> 3        36777 Persuasion Austen, Jane         68 fr      FR Lit… Publi… TRUE   
#> # … with abbreviated variable names ¹​gutenberg_author_id, ²​language,
#> #   ³​gutenberg_bookshelf, ⁴​has_text

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()
#> # A tibble: 53,840 × 8
#>    gutenberg_id title              author guten…¹ langu…² guten…³ rights has_t…⁴
#>           <int> <chr>              <chr>    <int> <chr>   <chr>   <chr>  <lgl>  
#>  1            1 "The Declaration … Jeffe…    1638 en      Politi… Publi… TRUE   
#>  2            2 "The United State… Unite…       1 en      Politi… Publi… TRUE   
#>  3            3 "John F. Kennedy'… Kenne…    1666 en      <NA>    Publi… TRUE   
#>  4            4 "Lincoln's Gettys… Linco…       3 en      US Civ… Publi… TRUE   
#>  5            5 "The United State… Unite…       1 en      United… Publi… TRUE   
#>  6            6 "Give Me Liberty … Henry…       4 en      Americ… Publi… TRUE   
#>  7            7 "The Mayflower Co… <NA>        NA en      <NA>    Publi… TRUE   
#>  8            8 "Abraham Lincoln'… Linco…       3 en      US Civ… Publi… TRUE   
#>  9            9 "Abraham Lincoln'… Linco…       3 en      US Civ… Publi… TRUE   
#> 10           10 "The King James V… <NA>        NA en      Banned… Publi… TRUE   
#> # … with 53,830 more rows, and abbreviated variable names ¹​gutenberg_author_id,
#> #   ²​language, ³​gutenberg_bookshelf, ⁴​has_text

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")
#> # A tibble: 10 × 8
#>    gutenberg_id title              author guten…¹ langu…² guten…³ rights has_t…⁴
#>           <int> <chr>              <chr>    <int> <chr>   <chr>   <chr>  <lgl>  
#>  1          105 "Persuasion"       Auste…      68 en      <NA>    Publi… TRUE   
#>  2          121 "Northanger Abbey" Auste…      68 en      Gothic… Publi… TRUE   
#>  3          141 "Mansfield Park"   Auste…      68 en      <NA>    Publi… TRUE   
#>  4          158 "Emma"             Auste…      68 en      <NA>    Publi… TRUE   
#>  5          161 "Sense and Sensib… Auste…      68 en      <NA>    Publi… TRUE   
#>  6          946 "Lady Susan"       Auste…      68 en      <NA>    Publi… TRUE   
#>  7         1212 "Love and Freinds… Auste…      68 en      <NA>    Publi… TRUE   
#>  8         1342 "Pride and Prejud… Auste…      68 en      Best B… Publi… TRUE   
#>  9        31100 "The Complete Pro… Auste…      68 en      <NA>    Publi… TRUE   
#> 10        42078 "The Letters of J… Auste…      68 en      <NA>    Publi… TRUE   
#> # … with abbreviated variable names ¹​gutenberg_author_id, ²​language,
#> #   ³​gutenberg_bookshelf, ⁴​has_text

# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))
#> # A tibble: 17 × 8
#>    gutenberg_id title              author guten…¹ langu…² guten…³ rights has_t…⁴
#>           <int> <chr>              <chr>    <int> <chr>   <chr>   <chr>  <lgl>  
#>  1          105 "Persuasion"       Auste…      68 en      <NA>    Publi… TRUE   
#>  2          121 "Northanger Abbey" Auste…      68 en      Gothic… Publi… TRUE   
#>  3          141 "Mansfield Park"   Auste…      68 en      <NA>    Publi… TRUE   
#>  4          158 "Emma"             Auste…      68 en      <NA>    Publi… TRUE   
#>  5          161 "Sense and Sensib… Auste…      68 en      <NA>    Publi… TRUE   
#>  6          946 "Lady Susan"       Auste…      68 en      <NA>    Publi… TRUE   
#>  7         1212 "Love and Freinds… Auste…      68 en      <NA>    Publi… TRUE   
#>  8         1342 "Pride and Prejud… Auste…      68 en      Best B… Publi… TRUE   
#>  9        17797 "Memoir of Jane A… Auste…    7603 en      <NA>    Publi… TRUE   
#> 10        31100 "The Complete Pro… Auste…      68 en      <NA>    Publi… TRUE   
#> 11        33513 "The Frightened P… Auste…   36446 en      <NA>    Publi… TRUE   
#> 12        39897 "Discoveries Amon… Layar…   40288 en      <NA>    Publi… TRUE   
#> 13        42078 "The Letters of J… Auste…      68 en      <NA>    Publi… TRUE   
#> 14        54010 "The Younger Sist… Hubba…   47662 en      <NA>    Publi… TRUE   
#> 15        54011 "The Younger Sist… Hubba…   47662 en      <NA>    Publi… TRUE   
#> 16        54012 "The Younger Sist… Hubba…   47662 en      <NA>    Publi… TRUE   
#> 17        54066 "The Younger Sist… Hubba…   47662 en      <NA>    Publi… TRUE   
#> # … with abbreviated variable names ¹​gutenberg_author_id, ²​language,
#> #   ³​gutenberg_bookshelf, ⁴​has_text

The meta-data currently in the package was last updated on 04 November 2022.

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that one version of Persuasion has ID 105 (see the URL here), so gutenberg_download(105) downloads this text.

persuasion <- gutenberg_download(105)
persuasion
#> # A tibble: 8,328 × 2
#>    gutenberg_id text         
#>           <int> <chr>        
#>  1          105 "Persuasion" 
#>  2          105 ""           
#>  3          105 ""           
#>  4          105 "by"         
#>  5          105 ""           
#>  6          105 "Jane Austen"
#>  7          105 ""           
#>  8          105 "(1818)"     
#>  9          105 ""           
#> 10          105 ""           
#> # … with 8,318 more rows

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line.

You can also provide gutenberg_download() a vector of IDs to download multiple books. For example, to download Renascence, and Other Poems (book 109) along with Persuasion, do:

books <- gutenberg_download(c(109, 105), meta_fields = "title")
books
#> # A tibble: 9,550 × 3
#>    gutenberg_id text          title     
#>           <int> <chr>         <chr>     
#>  1          105 "Persuasion"  Persuasion
#>  2          105 ""            Persuasion
#>  3          105 ""            Persuasion
#>  4          105 "by"          Persuasion
#>  5          105 ""            Persuasion
#>  6          105 "Jane Austen" Persuasion
#>  7          105 ""            Persuasion
#>  8          105 "(1818)"      Persuasion
#>  9          105 ""            Persuasion
#> 10          105 ""            Persuasion
#> # … with 9,540 more rows

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)
#> # A tibble: 2 × 2
#>   title                           n
#>   <chr>                       <int>
#> 1 Persuasion                   8328
#> 2 Renascence, and Other Poems  1222

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. “lcc” means Library of Congress Classification, while “lcsh” means Library of Congress subject headings:

gutenberg_subjects
#> # A tibble: 230,993 × 3
#>    gutenberg_id subject_type subject                                            
#>           <int> <chr>        <chr>                                              
#>  1            1 lcsh         United States -- History -- Revolution, 1775-1783 …
#>  2            1 lcsh         United States. Declaration of Independence         
#>  3            1 lcc          E201                                               
#>  4            1 lcc          JK                                                 
#>  5            2 lcsh         Civil rights -- United States -- Sources           
#>  6            2 lcsh         United States. Constitution. 1st-10th Amendments   
#>  7            2 lcc          JK                                                 
#>  8            2 lcc          KF                                                 
#>  9            3 lcsh         United States -- Foreign relations -- 1961-1963    
#> 10            3 lcsh         Presidents -- United States -- Inaugural addresses 
#> # … with 230,983 more rows

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")
#> # A tibble: 810 × 3
#>    gutenberg_id subject_type subject                      
#>           <int> <chr>        <chr>                        
#>  1          170 lcsh         Detective and mystery stories
#>  2          173 lcsh         Detective and mystery stories
#>  3          244 lcsh         Detective and mystery stories
#>  4          305 lcsh         Detective and mystery stories
#>  5          330 lcsh         Detective and mystery stories
#>  6          481 lcsh         Detective and mystery stories
#>  7          547 lcsh         Detective and mystery stories
#>  8          863 lcsh         Detective and mystery stories
#>  9          905 lcsh         Detective and mystery stories
#> 10         1155 lcsh         Detective and mystery stories
#> # … with 800 more rows

gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))
#> # A tibble: 54 × 3
#>    gutenberg_id subject_type subject                                           
#>           <int> <chr>        <chr>                                             
#>  1          108 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  2          221 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  3          244 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  4          834 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  5         1661 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  6         2097 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  7         2343 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  8         2344 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  9         2345 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#> 10         2346 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#> # … with 44 more rows

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors
#> # A tibble: 21,323 × 7
#>    gutenberg_author_id author              alias birth…¹ death…² wikip…³ aliases
#>                  <int> <chr>               <chr>   <int>   <int> <chr>   <chr>  
#>  1                   1 United States       U.S.…      NA      NA https:… <NA>   
#>  2                   3 Lincoln, Abraham    <NA>     1809    1865 https:… United…
#>  3                   4 Henry, Patrick      <NA>     1736    1799 https:… <NA>   
#>  4                   5 Adam, Paul          <NA>     1849    1931 https:… <NA>   
#>  5                   7 Carroll, Lewis      Dodg…    1832    1898 https:… <NA>   
#>  6                   8 United States. Cen… <NA>       NA      NA https:… Agency…
#>  7                   9 Melville, Herman    Melv…    1819    1891 https:… <NA>   
#>  8                  10 Barrie, J. M. (Jam… <NA>     1860    1937 https:… Barrie…
#>  9                  12 Smith, Joseph, Jr.  Smit…    1805    1844 https:… <NA>   
#> 10                  14 Madison, James      Unit…    1751    1836 https:… <NA>   
#> # … with 21,313 more rows, and abbreviated variable names ¹​birthdate,
#> #   ²​deathdate, ³​wikipedia

Analysis

What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words
#> # A tibble: 90,532 × 3
#>    gutenberg_id title      word      
#>           <int> <chr>      <chr>     
#>  1          105 Persuasion persuasion
#>  2          105 Persuasion by        
#>  3          105 Persuasion jane      
#>  4          105 Persuasion austen    
#>  5          105 Persuasion 1818      
#>  6          105 Persuasion chapter   
#>  7          105 Persuasion 1         
#>  8          105 Persuasion sir       
#>  9          105 Persuasion walter    
#> 10          105 Persuasion elliot    
#> # … with 90,522 more rows

word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts
#> # A tibble: 6,620 × 3
#>    title      word          n
#>    <chr>      <chr>     <int>
#>  1 Persuasion anne        447
#>  2 Persuasion captain     303
#>  3 Persuasion elliot      254
#>  4 Persuasion lady        214
#>  5 Persuasion wentworth   191
#>  6 Persuasion charles     155
#>  7 Persuasion time        152
#>  8 Persuasion sir         149
#>  9 Persuasion miss        125
#> 10 Persuasion walter      123
#> # … with 6,610 more rows

You may also find these resources useful: