There are several utilities in the R ecosystem for reproducible research. The package repana (for Reproducible Analysis) help in having a common directory structure where to save the files that should be consider part of the main stream of production, and files that are products of the main stream as well as modified files no longer part of the main stream but need to be kept such as re-formatted reports or presentations.
The aspiration of this package is that you can set up an analysis
with the make_structure()
function, have access to the
database using get_con()
function, have your tables in the
database documented with the update_table()
function, and
reproduce a complete analysis by running the master()
function
The function make_structure()
reads the
config.yml
files and ensure all entries on the
dirs
section exists. The config.yml
created by
default will produce the following directories for the data, functions,
database, logs, reports and handmade entries of the
config.yml
:
_data to keep all data sources need to reproduce the analysis.
_functions to keep all functions programmed for the analysis
database to keep all secondary datasets and objects
logs to keep the log of the scripts
reports to keep all secondary analysis reports and sheets
handmade to keep all modified files and reports that should be kept
The directories can be used in your programs by the
get_dirs()
function.
The information in data
, functions
,
handmade
as well as the scripts in the root directory
should be preserved as they are the core of your analysis. The files in
logs
, reports
and database
are
created and recreated as result of your analysis’ scripts. Those are the
results of your analysis.
The function clean_structure()
clean those directories
included in the list clean_before_new_analysis
so a new
analysis could be re-run without worries that new and old results are
mixed. If you use git
, those directories are candidates to
be excluded from the control version by having them in the
.gitignore
file. (make_structure()
take care
of create a .gitignore
if it does not exist or include
those directories if they have not been yet included).
The function make_structure()
writes a
config.yml
if it does not exists. This file is used by the
config::get()
function. It contains a the following entries
under the default: tree
dirs: to define the directories that make_structure
will maintain. It have the entry values for data
,
functions
, database
, reports
,
logs
, handmade
directories. If you prefer
other options than the defaults values you may change it. You may access
those directories in your programs using get_dirs()
Note
that the name for data
and function
does not
have a underscore but by default the values are _data
and
_function
respectively. make_structure
will
create the paths with the value of the entries but you access them in
your program with the name of the entry. This will provide the freedom
to direct the real path of the directory to any place you need.
clean_before_new_analysis: to define the list of
directories that should be cleaned every time you want to repeat the
analysis from zero. This directories are included in the
.gitignore
defaultdb: is written with the parameters for a
RSQLite SQLite
driver
You may add other entries that your analysis may requires. The
config.yml
itself should also be included in your
.gitignore
file as it is something that change from system
to system (i.e. driver parameters) so you should include in the
documentation of your analysis what entries should be defined so you can
reproduce the analysis in other machine.
DBI and Pool connections are used as a way to keep data as well as
results in a database system. You must provide, at least, values for the
package
and dbconnection
entries corresponding
to the package that host the dbConnection and the name of the
dbConnection function. Notice entry names are lowercase. The rest of the
entries must correspond for parameters for your driver connection or
pool connection.
Example to use RSQLite
with a results.db
file in the database directory
defaultdb:
package: RSQLite
dbconnect: SQLite
dbname: dbname/results.db
Example to use RPostgres
defaultdb:
package: RPostgres
dbconnection: Postgres
dbname: testdb
host: localhost
port: 5432
user: username
password: password
You can define several configuration to use different databases in
the same analysis, but the defaultdb
will be used by
default for the update_table()
function.
The function update_table
will save a data.frame into
the database, and will keep a log in the log_table
table
with the timestamps the file was updated in the database. The
log_table
keep a record of when was the table included in
the database and a comment that will help to trace the origin. You may
include the date data was obtained or the source of the data.
update_table(p_con,"iris", "from system)
The master(pattern, start, stop, logdir, rscript_path)
function execute in a plain vanilla R process each one of the files
identified by the pattern. By default use the pattern is
"^[0-9][0-9].*\\.R$"
, which include all files like
01_read_data.R
, 02_process_data.R
,
03_analysis_data.R
, 04_make_report.R
but not
report1.rmd
, exploratory.R
etc.. Files are run
on the order starting from the first but if for any reason you need to
omit the first files you may skip them with the start
parameter.
logdir
is the directory for the logs, by default
get_dirs()$logs
rscript_path
is the full path to the
Rscript
program which is at the end the one that process
the R
file. The current default is for a OS system. But
implementation for Linux and Windows will soon be implemented.
The master function use functions from the processx
package.
Here is an example of the config.yml
created by
make_structure()
default:
dirs:
data: _data
functions: _functions
handmade: handmade
database: database
reports: reports
logs: logs
clean_before_new_analysis:
- database
- reports
- logs
defaultdb:
package: RSQLite
dbconnect: SQLite
dbname: ":memory:"
driver: /usr/local/lib/libsqlite3odbc.dylib
database: database/results.db
You may download the package from git-hub. Within R you may use
devtools::install_github()
as:
devtools::install_github("johnaponte/repana", build_manual = T, build_vignettes = T)