This package is born out of my own frustration to automate the genomic data retrieval process to create computationally reproducible scripts for large-scale genomics studies. Since I couldn’t find easy-to-use and fully reproducible software libraries I sat down and tried to implement a framework that would enable anyone to automate and standardize the genomic data retrieval process. I hope that this package is useful to others as well and that it helps to promote reproducible research in genomics studies.
I happily welcome anyone who wishes to contribute to this project :) Just drop me an email.
Please find a detailed documentation here.
Please cite biomartr
if it was helpful for your
research. This will allow me to continue maintaining this project in the
future.
Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821.
The vastly growing number of sequenced genomes allows us to perform a new type of biological research. Using a comparative approach these genomes provide us with new insights on how biological information is encoded on the molecular level and how this information changes over evolutionary time.
The first step, however, of any genome based study is to retrieve
genomes and their annotation from databases. To automate the retrieval
process of this information on a meta-genomic scale, the
biomartr
package provides interface functions for genomic
sequence retrieval and functional annotation retrieval. The major aim of
biomartr
is to facilitate computational reproducibility and
large-scale handling of genomic data for (meta-)genomic analyses. In
addition, biomartr
aims to address the
genome version crisis
. With biomartr
users can
now control and be informed about the genome versions they retrieve
automatically. Many large scale genomics studies lack this information
and thus, reproducibility and data interpretation become nearly
impossible when documentation of genome version information gets
neglected.
In detail, biomartr
automates genome, proteome, CDS,
RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and
metagenome project data retrieval from the major biological databases
such as
ENSEMBL
and ENSEMBLGENOMES
were joined
- see details
here)Furthermore, an interface to the Ensembl Biomart
database allows users to retrieve functional annotation for genomic loci
using a novel and organism centric search strategy. In addition, users
can download
entire databases such as
NCBI RefSeq
NCBI nr
NCBI nt
NCBI Genbank
ENSEMBL
with only one command.
The main difference between the BiomaRt
package and the biomartr package is that
biomartr
extends the
functional annotation retrieval
procedure of
BiomaRt
and in addition provides useful
retrieval functions for genomes, proteomes, coding sequences, gff files,
RNA sequences, Repeat Masker annotations files, and functions for the
retrieval of entire databases such as NCBI nr
etc.
Please consult the Tutorials section for more details.
In the context of functional annotation retrieval
the
biomartr
package allows users to screen available marts
using only the scientific name of an organism of interest instead of
first searching for marts and datasets which support a particular
organism of interest (which is required when using the
BiomaRt
package). Furthermore, biomartr
allows
you to search for particular topics when searching for attributes and
filters. I am aware that the similar naming of the packages is
unfortunate, but it arose due to historical reasons (please find a
detailed explanation here:
https://github.com/ropensci/biomartr/blob/master/FAQs.md and here #11).
I also dedicated an
entire vignette to compare the BiomaRt
and
biomartr
package functionality in the context of
Functional Annotation
(where their functionality overlaps
which comprises about only 20% of the overall functionality of the
biomartr package).
I truly value your opinion and improvement suggestions. Hence, I would be extremely grateful if you could take this 1 minute and 3 question survey (https://goo.gl/forms/Qaoxxjb1EnNSLpM02) so that I can learn how to improve
biomartr
in the best possible way. Many many thanks in advance.
The biomartr
package relies on some Bioconductor tools and
thus requires installation of the following packages:
# Install core Bioconductor packages
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
::install()
BiocManager# Install package dependencies
::install("Biostrings")
BiocManager::install("biomaRt") BiocManager
Now users can install biomartr
from CRAN:
# install biomartr 1.0.2
install.packages("biomartr", dependencies = TRUE)
With an activated Bioconda channel (see 2. Set up channels), install with:
conda install r-biomartr
and update with:
conda update r-biomartr
or use the docker container:
docker pull quay.io/biocontainers/r-biomartr:<tag>
(check r-biomartr/tags
for valid values for
The automated retrieval of collections (= Genome, Proteome, CDS, RNA,
GFF, Repeat Masker, AssemblyStats files) will make sure that the genome
file of an organism will match the CDS, proteome, RNA, GFF, etc file and
was generated using the same genome assembly version. One aspect of why
genomics studies fail in computational and biological reproducibility is
that it is not clear whether CDS, proteome, RNA, GFF, etc files used in
a proposed analysis were generated using the same genome assembly file
denoting the same genome assembly version. To avoid this seemingly
trivial mistake we encourage users to retrieve genome file collections
using the biomartr
function getCollection()
and attach the corresponding output as Supplementary Data to the
respective genomics study to ensure computational and biological
reproducibility.
# download collection for Saccharomyces cerevisiae
::getCollection( db = "refseq", organism = "Saccharomyces cerevisiae") biomartr
Internally, the getCollection()
function will now
generate a folder named
refseq/Collection/Saccharomyces_cerevisiae
and will store
all genome and annotation files for
Saccharomyces cerevisiae
in the same folder. In addition,
the exact genoem and annotation version will be logged in the
doc
folder.
Internally, a text file named
doc_Saccharomyces_cerevisiae_db_refseq.txt
is generated.
The information stored in this log file is structured as follows:
File Name: Saccharomyces_cerevisiae_assembly_stats_refseq.txt
Organism Name: Saccharomyces_cerevisiae
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt
Download_Date: Wed Jun 27 15:21:51 2018
refseq_category: reference genome
assembly_accession: GCF_000146045.2
bioproject: PRJNA128
biosample: NA
taxid: 559292
infraspecific_name: strain=S288C
version_status: latest
release_type: Major
genome_rep: Full
seq_rel_date: 2014-12-17
submitter: Saccharomyces Genome Database
In an ideal world this reference file could then be included as supplementary information in any life science publication that relies on genomic information so that reproducibility of experiments and analyses becomes achievable.
Download all mammalian vertebrate genomes from
NCBI RefSeq
via:
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")
All geneomes are stored in the folder named according to the kingdom.
In this case vertebrate_mammalian
. Alternatively, users can
specify the out.folder
argument to define a custom output
folder path.
Please find all FAQs here.
I would be very happy to learn more about potential improvements of the concepts and functions provided in this package.
Furthermore, in case you find some bugs or need additional (more flexible) functionality of parts of this package, please let me know:
https://github.com/HajkD/biomartr/issues
Getting Started with biomartr
:
Users can also read the tutorials within (RStudio) :
# source the biomartr package
library(biomartr)
# look for all tutorials (vignettes) available in the biomartr package
# this will open your web browser
browseVignettes("biomartr")
The current status of the package as well as a detailed history of
the functionality of each version of biomartr
can be found
in the NEWS
section.
Some bug fixes or new functionality will not be available on CRAN
yet, but in the developer version here on GitHub. To download and
install the most recent version of biomartr
run:
# install the current version of biomartr on your system
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("ropensci/biomartr") BiocManager
meta.retrieval()
: Perform Meta-Genome Retieval from
NCBI of species belonging to the same kingdom of life or to the same
taxonomic subgroupmeta.retrieval.all()
: Perform Meta-Genome Retieval
from NCBI of the entire kingdom of lifegetMetaGenomes()
: Retrieve metagenomes from NCBI
GenbankgetMetaGenomeAnnotations()
: Retrieve annotation *.gff
files for metagenomes from NCBI GenbanklistMetaGenomes()
: List available metagenomes on NCBI
GenbankgetMetaGenomeSummary()
: Helper function to retrieve
the assembly_summary.txt file from NCBI genbank metagenomesclean.retrieval()
: Format meta.retrieval outputlistGenomes()
: List all genomes available on NCBI and
ENSEMBL serverslistKingdoms()
: list the number of available species
per kingdom of life on NCBI and ENSEMBL serverslistGroups()
: list the number of available species per
group on NCBI and ENSEMBL serversgetKingdoms()
: Retrieve available kingdoms of
lifegetGroups()
: Retrieve available groups for a kingdom
of lifeis.genome.available()
: Check Genome Availability NCBI
and ENSEMBL serversgetCollection()
: Retrieve a Collection: Genome,
Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStatsgetGenome()
: Download a specific genome stored on NCBI
and ENSEMBL serversgetGenomeSet()
: Genome Retrieval of multiple
speciesgetProteome()
: Download a specific proteome stored on
NCBI and ENSEMBL serversgetProteomeSet()
: Proteome Retrieval of multiple
speciesgetCDS()
: Download a specific CDS file (genome) stored
on NCBI and ENSEMBL serversgetCDSSet()
: CDS Retrieval of multiple speciesgetRNA()
: Download a specific RNA file stored on NCBI
and ENSEMBL serversgetRNASet()
: RNA Retrieval of multiple speciesgetGFF()
: Genome Annotation Retrieval from NCBI
(*.gff
) and ENSEMBL (*.gff3
) serversgetGTF()
: Genome Annotation Retrieval
(*.gtf
) from ENSEMBL serversgetRepeatMasker() :
Repeat Masker TE Annotation
RetrievalgetAssemblyStats()
: Genome Assembly Stats Retrieval
from NCBIgetKingdomAssemblySummary()
: Helper function to
retrieve the assembly_summary.txt files from NCBI for all kingdomsgetMetaGenomeSummary()
: Helper function to retrieve
the assembly_summary.txt files from NCBI genbank metagenomesgetSummaryFile()
: Helper function to retrieve the
assembly_summary.txt file from NCBI for a specific kingdomgetENSEMBLInfo()
: Retrieve ENSEMBL info filegetGENOMEREPORT()
: Retrieve GENOME_REPORTS file from
NCBIread_genome()
: Import genomes as Biostrings or
data.table objectread_proteome()
: Import proteome as Biostrings or
data.table objectread_cds()
: Import CDS as Biostrings or data.table
objectread_gff()
: Import GFF fileread_rna()
: Import RNA fileread_rm()
: Import Repeat Masker output fileread_assemblystats()
: Import Genome Assembly Stats
FilelistNCBIDatabases()
: Retrieve a List of Available NCBI
Databases for Downloaddownload.database()
: Download a NCBI database to your
local hard drivedownload.database.all()
: Download a complete NCBI
Database such as e.g. NCBI nr
to your local hard drivebiomart()
: Main function to query the BioMart
databasegetMarts()
: Retrieve All Available BioMart
DatabasesgetDatasets()
: Retrieve All Available Datasets for a
BioMart DatabasegetAttributes()
: Retrieve All Available Attributes for
a Specific DatasetgetFilters()
: Retrieve All Available Filters for a
Specific DatasetorganismBM()
: Function for organism specific retrieval
of available BioMart marts and datasetsorganismAttributes()
: Function for organism specific
retrieval of available BioMart attributesorganismFilters()
: Function for organism specific
retrieval of available BioMart filtersgetGO()
: Function to retrieve GO terms for a given set
of genes# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")
# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:
::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
devtools
# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.