New function check_annotation_biomartr()
helps to
check whether downloaded GFF or GTF files are corrupt. Find more details
here
new function getCollectionSet()
allows users to
retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker,
AssemblyStats of multiple species
Example:
# define scientific names of species for which
# collections shall be retrieved
<- c("Arabidopsis thaliana",
organism_list "Arabidopsis lyrata",
"Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
getCollectionSet( db = "refseq",
organism = organism_list,
path = "set_collections")
getGFF()
function receives a new argument
remove_annotation_outliers
to enable users to remove
corrupt lines from a GFF file Example:<- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE) Ath_path
the getGFFSet()
function receives a new argument
remove_annotation_outliers
to enable users to remove
corrupt lines from a GFF file
the getGTF()
function receives a new argument
remove_annotation_outliers
to enable users to remove
corrupt lines from a GTF file
adding a new message system to
biomartr::organismBM()
,
biomartr::organismAttributes()
, and
biomartr::organismFilters()
so that large API queries don’t
seem so unresponsive
getCollection()
receives new arguments
release
, remove_annotation_outliers
, and
gunzip
that will now be passed on to downstream retrieval
functions
the getGTF()
, getGenome()
and
getGenomeSet()
functions receives a new argument
assembly_type = "toplevel"
to enable users to choose
between toplevel and primary assembly when using ensembl database.
Setting assembly_type = "primary_assembly"
will save a lot
a space on hard drives for people using large ensembl genomes.
all get*()
functions with release
argument now check if the ENSEMBL release is >45 (Many thanks to
@Roleren #31
#61)
in all get*()
functions, the
readr::write_tsv(path = )
was exchanged to
readr::write_tsv(file = )
, since the readr
package version > 1.4.0 is depreciating the path
argument.
tbl_df()
was deprecated in dplyr 1.0.0. Please use
tibble::as_tibble()
instead. -> adjusted
organismBM()
accordingly
custom_download()
, getGENOMEREPORT()
,
and other download functions now have specified
withr::local_options(timeout = max(30000000, getOption("timeout")))
which extends the default 60sec timeout to 30000000sec
Fixing bug where genome availability check in
getCollection()
was only performed in
NCBI RefSeq
and not in other databases due to a constant
used in is.genome.available()
rather than a variable (Many
thanks to Takahiro Yamada for catching the bug) #53
fixing an issue that caused the read_cds()
function
to fail in data.table
mode (Many thanks to Clement Kent)
#57
fixing an SSL
bug that was found on
Ubuntu 20.04
systems #66 (Many thanks to Håkon
Tjeldnes)
fixing global variable issue that caused
clean.retrieval()
to fail when no documentation file was in
a meta.retrieval()
folder
The NCBI recently started adding NA
values as FTP
file paths in their species summary files
for species
without reference genomes. As a result meta.retrieval()
stopped working, because no FTP paths were found for some species. This
issue was now fixed by adding the filter rule
!is.na(ftp_path)
into all get*()
functions
(Many thanks for making me aware of this issue Ashok Kumar Sharma #34
and Dominik Merges #72)
Fixing an issue in custom_download()
where the
method
argument was causing issues when downloading from
https
directed ftp
sites (Many thanks to @cmatKhan) #76
Fixing issue when trying to combine multiple summary-stats files
where NA’s were present in the list item that was passed along for
combination in meta.retrieval()
#73 (Many thanks to Dominik
Merges)
Fixing a bug in download.database.all()
where the
lack of removing listed file *-metadata.json
caused
corruption of the download process (Many thanks to Jaruwatana
Lotharukpong)
biomartr 0.9.2 - minor changes to comply with CRAN policy regarding Internet access failure -> Instead of using warnings or error messages, only gentle messages are allowed to be used
Please be aware that as of April 2019, ENSEMBLGENOMES was
retired (see
details here). Hence, all biomartr
functions were
updated and won’t support data retrieval from
ENSEMBLGENOMES
servers anymore.
clean.retrieval()
enables formatting and
automatic unzipping of meta.retrieval output (find out more here:
https://docs.ropensci.org/biomartr/articles/MetaGenome_Retrieval.html#un-zipping-downloaded-files)getGenomeSet()
allows users to easily
retrieve genomes of multiple specified species. In addition, the genome
summary statistics for all retrieved species will be stored as well to
provide users with insights regarding the genome assembly quality of
each species. This file can be used as Supplementary Information file in
publications to facilitate reproducible research.getProteomeSet()
allows users to easily
retrieve proteomes of multiple specified speciesgetCDSSet()
allows users to easily
retrieve coding sequences of multiple specified speciesgetGFFSet()
allows users to easily
retrieve GFF annotation files of multiple specified speciesgetRNASet()
allows users to easily
retrieve RNA sequences of multiple specified speciessummary_genome()
allows users to retrieve
summary statistics for a genome assembly file to assess the influence of
genome assembly qualities when performing comparative genomics
taskssummary_cds()
allows users to retrieve
summary statistics for a coding sequence (CDS) file. We noticed, that
many CDS files stored in NCBI or ENSEMBL databases contain sequences
that aren’t divisible by 3 (division into codons). This makes it
difficult to divide CDS into codons for e.g. codon alignments or
translation into protein sequences. In addition, some CDS files contain
a significant amount of sequences that do not start with AUG (start
codon). This function enables users to quantify how many of these
sequences exist in a downloaded CDS file to process these files
according to the analyses at hand.reference
in
meta.retrieval()
changed from reference = TRUE
to reference = FALSE
. This way all genomes (reference AND
non-reference) genomes will be downloaded by default. This is what users
seem to prefer.getCollection()
now also retrieves GTF
files when db = 'ensembl'
getAssemblyStats()
now also performs md5 checksum
testgetGTF()
: users can now specify the NCBI Taxonomy ID or
Accession ID in addition to the scientific name in argument ‘organism’
to retrieve genome assembliesgetGFF()
: users can now specify the NCBI Taxonomy ID or
Accession ID for ENSEMBL in addition to the scientific name in argument
‘organism’ to retrieve genome assembliesgetMarts()
will now throw an error when BioMart servers
cannot be reached (#36)getGenome()
now also stores the genome summary
statistics (see ?summary_genome()
) for the retrieved
species in the documentation
folder to provide users with
insights regarding the genome assembly qualityreference
is now set from reference = TRUE
to
reference = FALSE
(= new default)get*()
functions now received a new argument
release
which allows users to retrieve specific release
versions of genomes, proteomes, etc from ENSEMBL
and
ENSEMBLGENOMES
get*()
functions received two new arguments
clean_retrieval
and gunzip
which allows users
to upzip the downloaded files directly in the get*()
function call and rename the file for more convenient downstream
analysesgetCollection()
for retrieval of a
collection: the genome sequence, protein sequences, gff files, etc for a
particular speciesgetProteome()
can now retrieve proteomes from the UniProt database by specifying
getProteome(db = "uniprot")
.
is.genome.available()
now prints out more useful
interactive messages when searching for available organisms
is.genome.available()
can now handle
taxids
and assembly_accession ids
in addition
to the scientific name when specifying argument
organism
is.genome.available()
can now check for organism
availability in the UniProt database
getGenome()
: users can now specify the NCBI Taxonomy
ID or Accession ID in addition to the scientific name in argument
‘organism’ to retrieve genome assemblies
getProteome()
: users can now specify the NCBI
Taxonomy ID or Accession ID in addition to the scientific name in
argument ‘organism’ to retrieve proteomes
getCDS()
: users can now specify the NCBI Taxonomy ID
or Accession ID in addition to the scientific name in argument
‘organism’ to retrieve CDS
getRNA()
: users can now specify the NCBI Taxonomy ID
or Accession ID in addition to the scientific name in argument
‘organism’ to retrieve RNAs
is.genome.available()
: argument order was changed
from is.genome.available(organism, details, db) to
is.genome.available(db, organism, details) to be logically more
consistent with all get*()
functions
meta.retrieval
receives a new argument
restart_at_last
to indicate whether or not the download
process when re-running the meta.retrieval
function shall
pick up at the last species or whether it should crawl through all
existing files to check the md5checksum
meta.retrieval
now generates an csv overview file in
the doc
folder which stores genome version, date, origin,
etc information for all downloaded organisms and can be directly used as
Supplementary Data file in publications to increase computational and
biological reproducibility of the genomics study
download.database.all()
can now skip already
downloaded files and internally removes corrupted files with
non-matching md5checksum. Re-downloading of currupted files and be
performed by simply re-running the download.database.all()
command
the function meta.retrieval()
will now pick up the
download at the organism where it left off and will report which species
have already been retrieved
all get*()
functions and the
meta.retrieval()
function receive a new argument
reference
which allows users to retrieve not-reference or
not-representative genome versions when downloading from NCBI RefSeq or
NCBI Genbank
the argument order in meta.retrieval()
changed from
meta.retrieval(kingdom, group, db, ...)
to
meta.retrieval(db,kingdom, group, ...)
to make the argument
order more consistent with the get*()
functions
the argument order in getGroups()
changed from
getGroups(kingdom, db)
to
getGroups(db, kingdom)
to make the argument order more
consistent with the get*()
and
meta.retrieval()
functions
existingOrganisms()
and
existingOrganisms_ensembl()
which check the organisms that
have already been downloadedfixing a bug in exists.ftp.file()
and
getENSEMBLGENOMES.Seq()
that caused bacterial genome,
proteome, etc retrieval to fail due to the wrong construction of a query
ftp request https://github.com/ropensci/biomartr/issues/7 (Many thanks
to @dbsseven)
fix a major bug in which organisms having no representative
genome would generate NULL paths that subsequently crashed the
meta.retrieval()
function when it tried to print out the
result paths.
new function getRepeatMasker()
for retrieval of
Repeat Masker output files
new function getGTF()
for genome annotation
retrieval from ensembl
and ensemblgenomes
in
gtf
format (Thanks for suggesting it Ge Tan)
new function getRNA()
to perform RNA Sequence
Retrieval from NCBI and ENSEMBL databases (Thanks for suggesting it
@carlo-berg)
new function read_rna()
for importing Repeat Masker
output files downloaded with getRepeatMasker()
new function read_rm()
for importing RNA downloaded
with getRNA()
as Biostrings or data.table object
new helper function custom_download()
that aims to
make the download process more robust and stable -> In detail, the
download process is now adapting to the operating system, e.g. using
either curl
(macOS), wget
(Linux), or
wininet
(Windows)
function name listDatabases()
has been renamed
listNCBIDatabases()
. In biomartr
version 0.6.0
the function name listDatabases()
will be
depreciated
meta.retieval()
and meta.retieval.all()
now allow the bulk retrieval of GTF files for
type = 'ensembl'
and type = 'esnemblgenomes'
via type = "gtf"
. See getGTF()
for more
details.
meta.retieval()
and meta.retieval.all()
now allow the bulk retrieval of RNA files via type = "rna"
.
See getRNA()
for more details.
meta.retieval()
and meta.retieval.all()
now allow the bulk retrieval of Repeat Masker output files via
type = "rm"
. See getRepeatMasker()
for more
details.
all get*()
retrieval functions now skip the download
of a particular file if it already exists in the specified file
path
download.database()
and
download.database.all()
now internally perform md5 check
sum checks to make sure that the file download was successful
download.database()
and
download.database.all()
now return the file paths of the
downloaded file so that it is easier to use these functions when
constructing pipelines, e.g. download.database() %>% ...
or download.database.all() %>% ...
.
meta.retrieval()
and
meta.retrieval.all()
now return the file paths of the
downloaded file so that it is easier to use these functions when
constructing pipelines, e.g. meta.retrieval() %>% ...
or
meta.retrieval() %>% ...
.
getGenome()
, getProteome()
,
getCDS()
, getRNA()
, getGFF()
, and
getAssemblyStats()
now internally perform md5 checksum
tests to make sure that files are retrieved intact.
get*()
(genome, proteome, gff,
etc.) and meta.retrieval*()
functions the meta retrieval
process errored and terminated whenever NCBI or ENSEMBL didn’t store all
types of sequences for a particular organism: genome, proteome, cds,
etc. This has been fixed now and function calls such as
meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")
should work properly now (Thanks to @ARamesh123 for making me aware if this
bug). Hence, this bug affected all attempts to download all proteome
sequences e.g. for bacteria and viruses, because NCBI does not store
genome AND proteome information for all bacterial or viral species.new function getAssemblyStats()
allows users to
retrieve the genome assembly stats file from NCBI RefSeq or Genbank,
e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt
new function read_assemblystats()
allows to import
the genome assembly stats file from NCBI RefSeq or Genbank that was
retrieved using the getAssemblyStats()
function
meta.retrieval()
and
meta.retrieval.all()
can now also download genome assembly
stats for all selected species
meta.retrieval()
receives a new argument
group
that allows users to retrieve species belonging to a
subgroup instead of the entire kingdom. Available groups can be
retrieved with getGroups()
.
functions getSubgroups()
and
listSubgroups()
have been removed and their initial
functionality has been merged and integrated into
getGroups()
and listGroups()
listGroups()
receives a new argument
details
that allows users to retrieve the organism names
that belong to the corresponding subgroups
getGroups()
is now based on
listGroups()
internal function getGENOMESREPORT()
is now exported
and available to the user
all organism*()
functions now also support Ensembl
Plants, Ensembl Metazoa, Ensembl Protist, and Ensembl Fungi (Thanks for
pointing out Alex
Gabel)
getMarts()
and getDatasets()
now also
support Ensembl Plants, Ensembl Metazoa, Ensembl Protist, and Ensembl
Fungi (Thanks for pointing out Alex
Gabel)
Meta-Genome Retrieval
has more examples how to
download genomes of species that belong to the same subgroupgetSummaryFile()
,
getKingdomAssemblySummary()
,
getMetaGenomeSummary()
, getENSEMBL.Seq()
and
getENSEMBLGENOMES.Seq()
functions causing quoted lines in
the assembly_summary.txt
to be omitted when reading these
files. This artefact caused that e.g. instead of information of 80,000
Bacteria genomes only 40,000 (which non-quotations) were read (Thanks to
Xin Wu).In this version of biomartr
the organism*()
functions were adapted to the new ENSEMBL
87 release in which organism name specification in the Biomart
description column was changed
from a scientific name convention to a mix of common name and scientific
name convention.
all organism*()
functions have been adapted to the
new ENSEMBL 87 release organism name notation that is used in the
Biomart description
fixing error handling bug that caused commands such as
download.database(db = "nr.27.tar.gz")
to not execute
properly
In this version, biomartr
was extended to now retrieve
genome, proteome, CDS, GFF and meta-genome data also from ENSEMBL and ENSEMLGENOMES. Furthermore, all
NCBI retrieval functions were updated to the new server folder structure
standards of NCBI.
new meta-retrieval function meta.retrieval.all()
allows users to download all individual genomes of all kingdoms of life
with one command
new metagenome retrieval function getMetaGenomes()
allows users to retrieve metagenome projects from NCBI Genbank
new metagenome retrieval function
getMetaGenomeAnnotations()
allows users to retrieve
annotation files for genomes belonging to a metagenome project stored at
NCBI Genbank
new retrieval function getGFF()
allows users to
retrieve annotation (*.gff) files for specific genomes from NCBI and
ENSEMBL databases
new import function read_gff()
allowing users to
import GFF files downloaded with getGFF()
new internal functions to check for availability of ENSEMBL or ENSEMBLGENOMES databases
new database retrieval function
download.database.all()
allows users to download entire
NCBI databases with one command
new function listMetaGenomes()
allowing users to
list available metagenomes on NCBI Genbank
new external helper function getSummaryFile()
to
retrieve the assembly_summary.txt file from NCBI
new external helper function
getKingdomAssemblySummary()
to retrieve the
assembly_summary.txt files from NCBI for all kingdoms and combine them
into one big data.frame
new function listKingdoms()
allows users to list the
number of available species per kingdom of life
new function listGroups()
allows users to list the
number of available species per group
new function listSubgroups()
allows users to list
the number of available species per subgroup
new function getGroups()
allows users to retrieve
available groups for a kingdom of life
new function getSubgroups()
allows users to retrieve
available subgroups for a kingdom of life
new external helper function getMetaGenomeSummary()
to retrieve the assembly_summary.txt files from NCBI genbank
metagenomes
new internal helper function getENSEMBL.Seq()
acting
as main interface function to communicate with the ENSEMBL database API
for sequence retrieval
new internal helper function getENSEMBLGENOMES.Seq()
acting as main interface function to communicate with the ENSEMBL
database API for sequence retrieval
new internal helper function getENSEMBL.Annotation()
acting as main interface function to communicate with the ENSEMBL
database API for GFF retrieval
new internal helper function
getENSEMBLGENOMES.Annotation()
acting as main interface
function to communicate with the ENSEMBL database API for GFF
retrieval
new internal helper function
get.ensemblgenome.info()
to retrieve general organism
information from ENSEMBLGENOMES
new internal helper function get.ensembl.info()
to
retrieve general organism information from ENSEMBL
new internal helper function getGENOMEREPORT()
to
retrieve the genome reports file from
ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/overview.txt
new internal helper function connected.to.internet()
enabling internet connection check
functions getGenome()
, getProteome()
,
and getCDS()
now can also in addition to NCBI retrieve
genomes, proteomes or CDS from ENSEMBL and ENSEMLGENOMES
the functions getGenome()
,
getProteome()
, and getCDS()
were completely
re-written and now use the assembly_summary.txt files provided by NCBI
to retrieve the download path to the corresponding genome. Furthermore,
these functions now lost the kingdom
argument. Users now
only need to specify the organism name and not the kingdom anymore.
Furthermore, all get*
functions now return the path to the
downloaded genome so that this path can be used as input to all
read_*
functions.
download_databases()
has been renamed to
download.databases()
to be more consistent with other
function notation
the argument db_format
was removed from
listDatabases()
and download.database()
because it was misleading
the command listDatabases("all")
now returns all
available NCBI databases that can be retrieved with
download.database()
download.database()
now internally checks if input
database specified by the user is actually available on NCBI
servers
the documentary file generated by getGenome()
,
getProteome()
, and getCDS()
is now extended to
store more details about the downloaded genome
argument database
in
is.genome.available()
and listGenomes()
has
been renamed to db
to be consistent with all other sequence
retrieval functions
is.genome.available()
now also checks availability
of organisms in ENSEMBL. See db = "ensembl"
the argument db_name
in listDatabases()
has been renamed db
to be more consistent with the notation
in other functions
the argument name
in
download.database()
has been renamed db
to be
more consistent with the notation in other functions
getKingdoms()
now retrieves also kingdom information
for ENSEMBL and ENSEMBLGENOMES
getKingdoms()
received new argument db
to specify from which database (e.g. refseq
,
genbank
, ensembl
or
ensemblgenomes
) kingdom information shall be
retrieved
getKingdoms(db = "refseq")
received one more member:
"viral"
, allowing the genome retrieval of all
viruses
argument out.folder
in meta.retrieval()
has been renamed to path
to be more consistent with other
retrieval functions
all read_*
functions now received a new argument
obj.type
allowing users to choose between storing input
genomes as Biostrings object or data.table object
all read_*
functions now have
format = "fasta"
as default
the kingdom
argument in the
listGenomes()
function was renamed to type
,
now allowing users to specify not only specify kingdoms, but also groups
and subgroups. Use: listGenomes(type = "kingdom")
or
listGenomes(type = "group")
or
listGenomes(type = "subgroup")
the listGenomes()
function receives a new argument
subset
to specify a subset of the selected
type
argument. E.g. subset = "Eukaryota"
when
specifying type = "kingdom"
Meta-Genome Retrieval
Introduction
VignetteDatabase Retrieval
VignetteSequence Retrieval
VignetteFunctional Annotation
Vignettefixing a parsing error of the file
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt
The problem was that comment lines were introduced and columns couldn’t
be parsed correctly anymore. This caused that genomes, proteomes, and
CDS files could not be downloaded properly. This has been fixed
now.
genomes, proteome, and CDS as well as meta-genomes can now be
retrieved from RefSeq and Genbank (not only RefSeq); only
getCDS()
does not have genebank access, becasue genbank
does not provide CDS sequences
adding new function meta.retrieval()
to mass
retrieve genomes for entire kingdoms of life
fixed a major bug in organismBM()
causing the
function to fail. The failure of this function affected all downstream
organism*()
functions. Bug is now fixed and everything
works properly
updated Vignettes
updating unit tests for new API
fixing API problems that caused all BioMart related functions to fail
fixing retrieval problems in getCDS()
,
getProteome()
, and getGenome()
the listDatabases()
function now has a new option
db_name = "all"
allowing users to list all available
databases stored on NCBI
Release Version