The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:
gutenberg_download()
that downloads one or
more works from Project Gutenberg by ID: e.g.,
gutenberg_download(84)
downloads the text of
Frankenstein.gutenberg_metadata
contains information about each
work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors
contains information about each
author, such as aliases and birth/death yeargutenberg_subjects
contains pairings of works with
Library of Congress subjects and topicsThis package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.
The dataset gutenberg_metadata
contains information
about each work, pairing Gutenberg ID with title, author, language,
etc:
library(gutenbergr)
library(dplyr)
gutenberg_metadata#> # A tibble: 69,199 × 8
#> gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
#> 1 1 "The Declaration … Jeffe… 1638 en Politi… Publi… TRUE
#> 2 2 "The United State… Unite… 1 en Politi… Publi… TRUE
#> 3 3 "John F. Kennedy'… Kenne… 1666 en <NA> Publi… TRUE
#> 4 4 "Lincoln's Gettys… Linco… 3 en US Civ… Publi… TRUE
#> 5 5 "The United State… Unite… 1 en United… Publi… TRUE
#> 6 6 "Give Me Liberty … Henry… 4 en Americ… Publi… TRUE
#> 7 7 "The Mayflower Co… <NA> NA en <NA> Publi… TRUE
#> 8 8 "Abraham Lincoln'… Linco… 3 en US Civ… Publi… TRUE
#> 9 9 "Abraham Lincoln'… Linco… 3 en US Civ… Publi… TRUE
#> 10 10 "The King James V… <NA> NA en Banned… Publi… TRUE
#> # … with 69,189 more rows, and abbreviated variable names ¹gutenberg_author_id,
#> # ²language, ³gutenberg_bookshelf, ⁴has_text
For example, you could find the Gutenberg ID(s) of Jane Austen’s Persuasion by doing:
%>%
gutenberg_metadata filter(title == "Persuasion")
#> # A tibble: 3 × 8
#> gutenberg_id title author gutenber…¹ langu…² guten…³ rights has_t…⁴
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
#> 1 105 Persuasion Austen, Jane 68 en <NA> Publi… TRUE
#> 2 22963 Persuasion Austen, Jane 68 en <NA> Publi… FALSE
#> 3 36777 Persuasion Austen, Jane 68 fr FR Lit… Publi… TRUE
#> # … with abbreviated variable names ¹gutenberg_author_id, ²language,
#> # ³gutenberg_bookshelf, ⁴has_text
In many analyses, you may want to filter just for English works,
avoid duplicates, and include only books that have text that can be
downloaded. The gutenberg_works()
function does this
pre-filtering:
gutenberg_works()
#> # A tibble: 53,840 × 8
#> gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
#> 1 1 "The Declaration … Jeffe… 1638 en Politi… Publi… TRUE
#> 2 2 "The United State… Unite… 1 en Politi… Publi… TRUE
#> 3 3 "John F. Kennedy'… Kenne… 1666 en <NA> Publi… TRUE
#> 4 4 "Lincoln's Gettys… Linco… 3 en US Civ… Publi… TRUE
#> 5 5 "The United State… Unite… 1 en United… Publi… TRUE
#> 6 6 "Give Me Liberty … Henry… 4 en Americ… Publi… TRUE
#> 7 7 "The Mayflower Co… <NA> NA en <NA> Publi… TRUE
#> 8 8 "Abraham Lincoln'… Linco… 3 en US Civ… Publi… TRUE
#> 9 9 "Abraham Lincoln'… Linco… 3 en US Civ… Publi… TRUE
#> 10 10 "The King James V… <NA> NA en Banned… Publi… TRUE
#> # … with 53,830 more rows, and abbreviated variable names ¹gutenberg_author_id,
#> # ²language, ³gutenberg_bookshelf, ⁴has_text
It also allows you to perform filtering as an argument:
gutenberg_works(author == "Austen, Jane")
#> # A tibble: 10 × 8
#> gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
#> 1 105 "Persuasion" Auste… 68 en <NA> Publi… TRUE
#> 2 121 "Northanger Abbey" Auste… 68 en Gothic… Publi… TRUE
#> 3 141 "Mansfield Park" Auste… 68 en <NA> Publi… TRUE
#> 4 158 "Emma" Auste… 68 en <NA> Publi… TRUE
#> 5 161 "Sense and Sensib… Auste… 68 en <NA> Publi… TRUE
#> 6 946 "Lady Susan" Auste… 68 en <NA> Publi… TRUE
#> 7 1212 "Love and Freinds… Auste… 68 en <NA> Publi… TRUE
#> 8 1342 "Pride and Prejud… Auste… 68 en Best B… Publi… TRUE
#> 9 31100 "The Complete Pro… Auste… 68 en <NA> Publi… TRUE
#> 10 42078 "The Letters of J… Auste… 68 en <NA> Publi… TRUE
#> # … with abbreviated variable names ¹gutenberg_author_id, ²language,
#> # ³gutenberg_bookshelf, ⁴has_text
# or with a regular expression
library(stringr)
gutenberg_works(str_detect(author, "Austen"))
#> # A tibble: 17 × 8
#> gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
#> 1 105 "Persuasion" Auste… 68 en <NA> Publi… TRUE
#> 2 121 "Northanger Abbey" Auste… 68 en Gothic… Publi… TRUE
#> 3 141 "Mansfield Park" Auste… 68 en <NA> Publi… TRUE
#> 4 158 "Emma" Auste… 68 en <NA> Publi… TRUE
#> 5 161 "Sense and Sensib… Auste… 68 en <NA> Publi… TRUE
#> 6 946 "Lady Susan" Auste… 68 en <NA> Publi… TRUE
#> 7 1212 "Love and Freinds… Auste… 68 en <NA> Publi… TRUE
#> 8 1342 "Pride and Prejud… Auste… 68 en Best B… Publi… TRUE
#> 9 17797 "Memoir of Jane A… Auste… 7603 en <NA> Publi… TRUE
#> 10 31100 "The Complete Pro… Auste… 68 en <NA> Publi… TRUE
#> 11 33513 "The Frightened P… Auste… 36446 en <NA> Publi… TRUE
#> 12 39897 "Discoveries Amon… Layar… 40288 en <NA> Publi… TRUE
#> 13 42078 "The Letters of J… Auste… 68 en <NA> Publi… TRUE
#> 14 54010 "The Younger Sist… Hubba… 47662 en <NA> Publi… TRUE
#> 15 54011 "The Younger Sist… Hubba… 47662 en <NA> Publi… TRUE
#> 16 54012 "The Younger Sist… Hubba… 47662 en <NA> Publi… TRUE
#> 17 54066 "The Younger Sist… Hubba… 47662 en <NA> Publi… TRUE
#> # … with abbreviated variable names ¹gutenberg_author_id, ²language,
#> # ³gutenberg_bookshelf, ⁴has_text
The meta-data currently in the package was last updated on 04 November 2022.
The function gutenberg_download()
downloads one or more
works from Project Gutenberg based on their ID. For example, we earlier
saw that one version of Persuasion has ID 105 (see the URL here), so
gutenberg_download(105)
downloads this text.
<- gutenberg_download(105) persuasion
persuasion#> # A tibble: 8,328 × 2
#> gutenberg_id text
#> <int> <chr>
#> 1 105 "Persuasion"
#> 2 105 ""
#> 3 105 ""
#> 4 105 "by"
#> 5 105 ""
#> 6 105 "Jane Austen"
#> 7 105 ""
#> 8 105 "(1818)"
#> 9 105 ""
#> 10 105 ""
#> # … with 8,318 more rows
Notice it is returned as a tbl_df (a type of data frame) including
two variables: gutenberg_id
(useful if multiple books are
returned), and a character vector of the text, one row per line.
You can also provide gutenberg_download()
a vector of
IDs to download multiple books. For example, to download Renascence,
and Other Poems (book 109) along with
Persuasion, do:
<- gutenberg_download(c(109, 105), meta_fields = "title") books
books#> # A tibble: 9,550 × 3
#> gutenberg_id text title
#> <int> <chr> <chr>
#> 1 105 "Persuasion" Persuasion
#> 2 105 "" Persuasion
#> 3 105 "" Persuasion
#> 4 105 "by" Persuasion
#> 5 105 "" Persuasion
#> 6 105 "Jane Austen" Persuasion
#> 7 105 "" Persuasion
#> 8 105 "(1818)" Persuasion
#> 9 105 "" Persuasion
#> 10 105 "" Persuasion
#> # … with 9,540 more rows
Notice that the meta_fields
argument allows us to add
one or more additional fields from the gutenberg_metadata
to the downloaded text, such as title or author.
%>%
books count(title)
#> # A tibble: 2 × 2
#> title n
#> <chr> <int>
#> 1 Persuasion 8328
#> 2 Renascence, and Other Poems 1222
You may want to select books based on information other than their
title or author, such as their genre or topic.
gutenberg_subjects
contains pairings of works with Library
of Congress subjects and topics. “lcc” means Library of Congress
Classification, while “lcsh” means Library of Congress
subject headings:
gutenberg_subjects#> # A tibble: 230,993 × 3
#> gutenberg_id subject_type subject
#> <int> <chr> <chr>
#> 1 1 lcsh United States -- History -- Revolution, 1775-1783 …
#> 2 1 lcsh United States. Declaration of Independence
#> 3 1 lcc E201
#> 4 1 lcc JK
#> 5 2 lcsh Civil rights -- United States -- Sources
#> 6 2 lcsh United States. Constitution. 1st-10th Amendments
#> 7 2 lcc JK
#> 8 2 lcc KF
#> 9 3 lcsh United States -- Foreign relations -- 1961-1963
#> 10 3 lcsh Presidents -- United States -- Inaugural addresses
#> # … with 230,983 more rows
This is useful for extracting texts from a particular topic or genre,
such as detective stories, or a particular character, such as Sherlock
Holmes. The gutenberg_id
column can then be used to
download these texts or to link with other metadata.
%>%
gutenberg_subjects filter(subject == "Detective and mystery stories")
#> # A tibble: 810 × 3
#> gutenberg_id subject_type subject
#> <int> <chr> <chr>
#> 1 170 lcsh Detective and mystery stories
#> 2 173 lcsh Detective and mystery stories
#> 3 244 lcsh Detective and mystery stories
#> 4 305 lcsh Detective and mystery stories
#> 5 330 lcsh Detective and mystery stories
#> 6 481 lcsh Detective and mystery stories
#> 7 547 lcsh Detective and mystery stories
#> 8 863 lcsh Detective and mystery stories
#> 9 905 lcsh Detective and mystery stories
#> 10 1155 lcsh Detective and mystery stories
#> # … with 800 more rows
%>%
gutenberg_subjects filter(grepl("Holmes, Sherlock", subject))
#> # A tibble: 54 × 3
#> gutenberg_id subject_type subject
#> <int> <chr> <chr>
#> 1 108 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 2 221 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 3 244 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 4 834 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 5 1661 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 6 2097 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 7 2343 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 8 2344 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 9 2345 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 10 2346 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> # … with 44 more rows
gutenberg_authors
contains information about each
author, such as aliases and birth/death year:
gutenberg_authors#> # A tibble: 21,323 × 7
#> gutenberg_author_id author alias birth…¹ death…² wikip…³ aliases
#> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 1 United States U.S.… NA NA https:… <NA>
#> 2 3 Lincoln, Abraham <NA> 1809 1865 https:… United…
#> 3 4 Henry, Patrick <NA> 1736 1799 https:… <NA>
#> 4 5 Adam, Paul <NA> 1849 1931 https:… <NA>
#> 5 7 Carroll, Lewis Dodg… 1832 1898 https:… <NA>
#> 6 8 United States. Cen… <NA> NA NA https:… Agency…
#> 7 9 Melville, Herman Melv… 1819 1891 https:… <NA>
#> 8 10 Barrie, J. M. (Jam… <NA> 1860 1937 https:… Barrie…
#> 9 12 Smith, Joseph, Jr. Smit… 1805 1844 https:… <NA>
#> 10 14 Madison, James Unit… 1751 1836 https:… <NA>
#> # … with 21,313 more rows, and abbreviated variable names ¹birthdate,
#> # ²deathdate, ³wikipedia
What’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.
library(tidytext)
<- books %>%
words unnest_tokens(word, text)
words#> # A tibble: 90,532 × 3
#> gutenberg_id title word
#> <int> <chr> <chr>
#> 1 105 Persuasion persuasion
#> 2 105 Persuasion by
#> 3 105 Persuasion jane
#> 4 105 Persuasion austen
#> 5 105 Persuasion 1818
#> 6 105 Persuasion chapter
#> 7 105 Persuasion 1
#> 8 105 Persuasion sir
#> 9 105 Persuasion walter
#> 10 105 Persuasion elliot
#> # … with 90,522 more rows
<- words %>%
word_counts anti_join(stop_words, by = "word") %>%
count(title, word, sort = TRUE)
word_counts#> # A tibble: 6,620 × 3
#> title word n
#> <chr> <chr> <int>
#> 1 Persuasion anne 447
#> 2 Persuasion captain 303
#> 3 Persuasion elliot 254
#> 4 Persuasion lady 214
#> 5 Persuasion wentworth 191
#> 6 Persuasion charles 155
#> 7 Persuasion time 152
#> 8 Persuasion sir 149
#> 9 Persuasion miss 125
#> 10 Persuasion walter 123
#> # … with 6,610 more rows
You may also find these resources useful:
wikipedia
column in
gutenberg_author
to Wikipedia content with the WikipediR
package or to pageview statistics with the wikipediatrend
packageformat_reverse
function for reversing “Last, First”
names).