The goal of srt is to read SubRip text files as tabular data for easy analysis and manipulation.
You can install the development version of srt from GitHub with:
# install.packages("remotes")
::install_github("kiernann/srt") remotes
The .srt
standard is used to identify the subtitle
components for the columns of a data frame:
-->
and the time it should disappearlibrary(srt)
library(tidyverse)
library(tidytext)
<- srt_example() srt
#> 1
#> 00:01:25,210 --> 00:01:28,004
#> I owe everything to George Bailey.
#>
#> 2
#> 00:01:28,422 --> 00:01:30,298
#> Help him, dear Father.
#>
#> 3
#> 00:01:30,674 --> 00:01:33,718
#> Joseph, Jesus and Mary,
These subtitle files are parsed as data frames with separate columns.
<- read_srt(path = srt, collapse = " "))
(wonderful_life #> # A tibble: 2,268 x 4
#> n start end subtitle
#> <int> <dbl> <dbl> <chr>
#> 1 1 85.2 88.0 I owe everything to George Bailey.
#> 2 2 88.4 90.3 Help him, dear Father.
#> 3 3 90.7 93.7 Joseph, Jesus and Mary,
#> 4 4 93.8 96.4 help my friend Mr. Bailey.
#> 5 5 96.9 99.5 Help my son George tonight.
#> 6 6 100. 102. He never thinks about himself, God.
#> 7 7 102. 104. That's why he's in trouble.
#> 8 8 104. 105. George is a good guy.
#> 9 9 106. 108. Give him a break, God.
#> 10 10 108. 110. I love him, dear Lord.
#> # … with 2,258 more rows
This makes it easy to perform various text analysis on the subtitles.
%>%
wonderful_life unnest_tokens(word, subtitle) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words)
#> # A tibble: 1,651 x 2
#> word n
#> <chr> <int>
#> 1 george 216
#> 2 mary 85
#> 3 bailey 74
#> 4 hey 56
#> 5 harry 53
#> 6 yeah 50
#> 7 gonna 45
#> 8 potter 45
#> 9 home 34
#> 10 money 34
#> # … with 1,641 more rows
Or uniformly manipulate the numeric time stamps:
<- srt_shift(wonderful_life, seconds = 9.99) wonderful_life
The subtitle data frames can be easily re-written as valid SubRip files.
<- tempfile(fileext = ".srt")
tmp write_srt(wonderful_life, tmp, wrap = FALSE)
#> 1
#> 00:01:35,200 --> 00:01:37,994
#> I owe everything to George Bailey.
#>
#> 2
#> 00:01:38,412 --> 00:01:40,288
#> Help him, dear Father.
#>
#> 3
#> 00:01:40,664 --> 00:01:43,708
#> Joseph, Jesus and Mary,