An R package for evaluation of pair-wise synteny conservation at the genome-wide scale. It takes a table of orthologs and genome annotation files formatted as BED to automatically infer significantly conserved linkage groups, and order them on an oxford grid.
It has 5 functions :
Function | description |
---|---|
load_orthologs() | integrates genomic coordinates (bedfiles) of the orthologs of the two species to compare |
compute_macrosynteny() | compares all the chromosomes to each other and identifies the significantly conserved linkage groups |
reorder_macrosynteny() | takes an mbh_df (from load_orthologs()) and outputs an mbh_df with chromosome levels reordered by cluster and amount of orthologs (run alone or by setting plot_oxford_grid(…,auto_order_clusters = TRUE) ) |
plot_macrosynteny() | draws a dotplot displaying the significant linkage groups with their relative amount of orthologs |
plot_oxford_grid() | draws an oxford grid from an orthologs_df (output of either load_orthologs() or reorder_macrosynteny() |
We demonstrate the usage of the package using a subset of publicly available data.
Branchiostoma floridae (Simakov et al. 2020)
data download : https://www.ncbi.nlm.nih.gov/genome/?term=txid7739)
Paraescarpia echinospica (Sun et al. 2021)
(data download : https://doi.org/10.6084/m9.figshare.15050478.v1)
This package doesn’t compute the orthologs. I recommend to compute it as reciprocal best hits using rbhxpress. It is fast and accurate as it uses diamond blast.
Drawing the plots using this package require to have the following data :
let’s say I have the following orthologs :
sp1.gene.x1 sp2.gene.y1
sp1.gene.X2 sp2.gene.y2
...
sp1.gene.xn sp2.gene.yn
then, the genomic coordinates files must look like (BED format) :
chr4 200 600 sp1.gene.x1
chr8 10 400 sp1.gene.x2
...
chr12 900 980 sp1.gene.xn
chr1 100 200 sp2.gene.y1
chr6 50 200 sp2.gene.y2
...
chr8 300 480 sp2.gene.yn
I’m going to show usage of this package by comparing the data of the lancelet Branchiostoma floridae (Simakov et al. 2020) with the data of the vestimentifera (giant tubeworm) Paraescarpia echinospica (Sun et al. 2021).
Download the sequences of proteins (fasta format) and their genomic coordinates :
B.floridae : The data are available on ncbi at https://www.ncbi.nlm.nih.gov/genome/?term=txid7739
get the protein sequences at by clicking the “Download sequences in
FASTA format for protein”.
get the genomic coordinates by clicking “Download genome annotation in
tabular format” and further click download as csv.
P.echinospica :
The data are available on figshare under the doi : 10.6084/m9.figshare.15050478
Compute the reciprocal best hits of the fasta sequences. Using rbhXpress you can achieve it by typing the following in your terminal :
# In the terminal :
# call mbhXpress with using 6 threads :
bash mbhXpress -a GCF_000003815.2_Bfl_VNyyK_protein.faa -b Pec_ragoo_v1.0.pep.fasta -o Bflo_vs_Pech.tab -t 6
To convert the genome annotation to the bed file format, I’m using the following command lines (if unfamiliar with this you can use a spreadsheet software). The concept is to keep the chrom, chromStart, chromEnd mandatory fields plus the name optional field that links the genomic region with the protein product :
# In the terminal :
# B.floridae CSV file to bed
tail -n +2 proteins_75_971579.csv | cut -d "," -f1,3,4,9 | \
sed -e 's/\"//g' -e 's/,/\t/g' -e 's/chromosome /BFL/g' > Bflo.protein_products.bed
# P.echinospica gff file to bed
fgrep "gene" Pec_genes.ragoo_v1.0.gff | cut -f1,4,5,9 | cut -d ";" -f 1 | \
fgrep "Superscaffold" | sed -e 's/ID=//g' -e 's/Superscaffold/PEC/g' > Pech.protein_products.bed
Please note that I generated a subset of these datasets, and kept only 2500 ortholog pairs to lower the compilation time.
Now the data are ready to be loaded into R using macrosyntR :
library(macrosyntR)
<- load_orthologs(orthologs_table = system.file("extdata","Bflo_vs_Pech.tab",package="macrosyntR"),
my_orthologs_table sp1_bed = system.file("extdata","Bflo.protein_products.bed",package="macrosyntR"),
sp2_bed = system.file("extdata","Pech.protein_products.bed",package="macrosyntR"))
head(my_orthologs_table)
#> # A tibble: 6 × 10
#> sp1.ID sp1.Chr sp1.S…¹ sp1.End sp1.I…² sp2.ID sp2.Chr sp2.S…³ sp2.End sp2.I…⁴
#> <chr> <fct> <int> <int> <int> <chr> <fct> <int> <int> <int>
#> 1 XP_035… BFL19 6.91e6 6.93e6 18 PE_Sc… PEC12 82420 107047 1
#> 2 XP_035… BFL5 5.55e6 5.56e6 11 PE_Sc… PEC1 179174 209873 1
#> 3 XP_035… BFL5 1.86e7 1.86e7 105 PE_Sc… PEC1 213205 227811 2
#> 4 XP_035… BFL5 1.69e7 1.69e7 90 PE_Sc… PEC1 253998 296275 3
#> 5 XP_035… BFL7 4.12e5 4.17e5 3 PE_Sc… PEC13 276772 278568 1
#> 6 XP_035… BFL4 1.40e7 1.40e7 67 PE_Sc… PEC4 312995 335349 1
#> # … with abbreviated variable names ¹sp1.Start, ²sp1.Index, ³sp2.Start,
#> # ⁴sp2.Index
Let’s compute the pairs of chromosomes that have a significant amount of orthologs using compute_macrosynteny(). We can visualize the results on a dot plot using plot_macrosnyteny() and see the distributions of orthologs on an oxford grid using plot_oxford_grid()
# compute significance :
<- compute_macrosynteny(my_orthologs_table)
macrosynteny_df head(macrosynteny_df)
#> sp1.Chr sp2.Chr orthologs pval significant pval_adj
#> 1 BFL1 PEC1 7 9.999998e-01 no 1.000000e+00
#> 2 BFL1 PEC10 42 3.228837e-08 yes 6.941999e-06
#> 3 BFL1 PEC11 3 9.998885e-01 no 1.000000e+00
#> 4 BFL1 PEC12 5 9.999247e-01 no 1.000000e+00
#> 5 BFL1 PEC13 5 9.981700e-01 no 1.000000e+00
#> 6 BFL1 PEC14 1 9.965970e-01 no 1.000000e+00
# visualize the loaded data on a oxford grid :
plot_oxford_grid(my_orthologs_table,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# Visualize the results of the test of significance :
plot_macrosynteny(macrosynteny_df,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
Reordering the chromosomes using a network based greedy algorithm can be performed by calling the function reorder_macrosynteny. It returns an orthologs_df with reordered levels in sp1.Chr and sp2.Chr. These columns are factors where the levels determine the plotting order. You’ll see the results of the clustering, when drawing the oxford grid of this newly generated orthologs data.frame
# visualize the loaded data on a oxford grid :
<- reorder_macrosynteny(my_orthologs_table)
my_orthologs_table_reordered plot_oxford_grid(my_orthologs_table_reordered,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# compute significance and visualize on a dotplot :
<- compute_macrosynteny(my_orthologs_table_reordered)
macrosynteny_df_reordered plot_macrosynteny(macrosynteny_df_reordered,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
If you would like to subset some chromosomes of interest and manually reorder them you can still take advantage of functions implemented to handle data.frames. This task is out of the scope of this package, and can achieved using base R :
# select only the orthologs falling in the chromosomes of interest and plot:
<- subset(my_orthologs_table, sp1.Chr %in% c("BFL13","BFL15","BFL2","BFL3") & sp2.Chr %in% c("PEC2","PEC5","PEC11"))
subset_of_orthologs
plot_oxford_grid(subset_of_orthologs,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# reorder :
$sp2.Chr <- factor(subset_of_orthologs$sp2.Chr,levels = c("PEC5","PEC11","PEC2"))
subset_of_orthologsplot_oxford_grid(subset_of_orthologs,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
# Compute and plot macrosynteny :
<- compute_macrosynteny(subset_of_orthologs)
macrosynteny_of_subset plot_macrosynteny(macrosynteny_of_subset,
sp1_label = "B.floridae",
sp2_label = "P.echinospica")
The reordering can be performed on the row when calling plot_oxford_grid() by setting the reorder argument to TRUE.
# visualize the loaded data on a oxford grid with reordering and coloring by cluster :
plot_oxford_grid(my_orthologs_table,
sp1_label = "B.floridae",
sp2_label = "P.echinospica",
reorder = TRUE,
color_by = "clust")
# redo and color by sp1.Chr instead :
plot_oxford_grid(my_orthologs_table,
sp1_label = "B.floridae",
sp2_label = "P.echinospica",
reorder = TRUE,
color_by = "sp1.Chr")