macrosyntR

Sami El Hilali

2022/11/09

An R package for evaluation of pair-wise synteny conservation at the genome-wide scale. It takes a table of orthologs and genome annotation files formatted as BED to automatically infer significantly conserved linkage groups, and order them on an oxford grid.

Overview :

It has 5 functions :

Function description
load_orthologs() integrates genomic coordinates (bedfiles) of the orthologs of the two species to compare
compute_macrosynteny() compares all the chromosomes to each other and identifies the significantly conserved linkage groups
reorder_macrosynteny() takes an mbh_df (from load_orthologs()) and outputs an mbh_df with chromosome levels reordered by cluster and amount of orthologs (run alone or by setting plot_oxford_grid(…,auto_order_clusters = TRUE) )
plot_macrosynteny() draws a dotplot displaying the significant linkage groups with their relative amount of orthologs
plot_oxford_grid() draws an oxford grid from an orthologs_df (output of either load_orthologs() or reorder_macrosynteny()

Step-by-step tutorial :

We demonstrate the usage of the package using a subset of publicly available data.

1 - Pre-process and Load the data :

Foreword :

This package doesn’t compute the orthologs. I recommend to compute it as reciprocal best hits using rbhxpress. It is fast and accurate as it uses diamond blast.

Drawing the plots using this package require to have the following data :

  • A two columns table of orthologs (reciprocal best hits). Each gene must appear only once in the table.
  • genomic coordinates on species 1 for each gene having an ortholog.
  • genomic coordinates on species 2 for each gene having an ortholog.

let’s say I have the following orthologs :

sp1.gene.x1  sp2.gene.y1   
sp1.gene.X2  sp2.gene.y2   
...   
sp1.gene.xn  sp2.gene.yn   

then, the genomic coordinates files must look like (BED format) :

  • species1 :
chr4    200    600    sp1.gene.x1   
chr8     10    400    sp1.gene.x2   
...   
chr12   900    980    sp1.gene.xn   
  • species2 :
chr1    100    200    sp2.gene.y1   
chr6     50    200    sp2.gene.y2   
...   
chr8   300    480    sp2.gene.yn   

Example :

I’m going to show usage of this package by comparing the data of the lancelet Branchiostoma floridae (Simakov et al. 2020) with the data of the vestimentifera (giant tubeworm) Paraescarpia echinospica (Sun et al. 2021).

Download the sequences of proteins (fasta format) and their genomic coordinates :

  • B.floridae : The data are available on ncbi at https://www.ncbi.nlm.nih.gov/genome/?term=txid7739
    get the protein sequences at by clicking the “Download sequences in FASTA format for protein”.
    get the genomic coordinates by clicking “Download genome annotation in tabular format” and further click download as csv.

  • P.echinospica :
    The data are available on figshare under the doi : 10.6084/m9.figshare.15050478

Compute the reciprocal best hits of the fasta sequences. Using rbhXpress you can achieve it by typing the following in your terminal :

 # In the terminal :
 # call mbhXpress with using 6 threads :
 bash mbhXpress -a GCF_000003815.2_Bfl_VNyyK_protein.faa -b Pec_ragoo_v1.0.pep.fasta -o Bflo_vs_Pech.tab -t 6
 

To convert the genome annotation to the bed file format, I’m using the following command lines (if unfamiliar with this you can use a spreadsheet software). The concept is to keep the chrom, chromStart, chromEnd mandatory fields plus the name optional field that links the genomic region with the protein product :

 # In the terminal :
 # B.floridae CSV file to bed
tail -n +2 proteins_75_971579.csv | cut -d "," -f1,3,4,9  | \
sed -e 's/\"//g' -e 's/,/\t/g' -e 's/chromosome /BFL/g' > Bflo.protein_products.bed

 # P.echinospica gff file to bed
fgrep "gene" Pec_genes.ragoo_v1.0.gff | cut -f1,4,5,9 | cut -d ";" -f 1 | \
fgrep "Superscaffold" | sed -e 's/ID=//g' -e 's/Superscaffold/PEC/g' > Pech.protein_products.bed

Please note that I generated a subset of these datasets, and kept only 2500 ortholog pairs to lower the compilation time.

Now the data are ready to be loaded into R using macrosyntR :

library(macrosyntR)
my_orthologs_table <- load_orthologs(orthologs_table = system.file("extdata","Bflo_vs_Pech.tab",package="macrosyntR"),
                                     sp1_bed = system.file("extdata","Bflo.protein_products.bed",package="macrosyntR"),
                                     sp2_bed = system.file("extdata","Pech.protein_products.bed",package="macrosyntR"))

head(my_orthologs_table)
#> # A tibble: 6 × 10
#>   sp1.ID  sp1.Chr sp1.S…¹ sp1.End sp1.I…² sp2.ID sp2.Chr sp2.S…³ sp2.End sp2.I…⁴
#>   <chr>   <fct>     <int>   <int>   <int> <chr>  <fct>     <int>   <int>   <int>
#> 1 XP_035… BFL19    6.91e6  6.93e6      18 PE_Sc… PEC12     82420  107047       1
#> 2 XP_035… BFL5     5.55e6  5.56e6      11 PE_Sc… PEC1     179174  209873       1
#> 3 XP_035… BFL5     1.86e7  1.86e7     105 PE_Sc… PEC1     213205  227811       2
#> 4 XP_035… BFL5     1.69e7  1.69e7      90 PE_Sc… PEC1     253998  296275       3
#> 5 XP_035… BFL7     4.12e5  4.17e5       3 PE_Sc… PEC13    276772  278568       1
#> 6 XP_035… BFL4     1.40e7  1.40e7      67 PE_Sc… PEC4     312995  335349       1
#> # … with abbreviated variable names ¹​sp1.Start, ²​sp1.Index, ³​sp2.Start,
#> #   ⁴​sp2.Index

2 - Compute linkage groups and plot :

Let’s compute the pairs of chromosomes that have a significant amount of orthologs using compute_macrosynteny(). We can visualize the results on a dot plot using plot_macrosnyteny() and see the distributions of orthologs on an oxford grid using plot_oxford_grid()


# compute significance :
macrosynteny_df <- compute_macrosynteny(my_orthologs_table)
head(macrosynteny_df)
#>   sp1.Chr sp2.Chr orthologs         pval significant     pval_adj
#> 1    BFL1    PEC1         7 9.999998e-01          no 1.000000e+00
#> 2    BFL1   PEC10        42 3.228837e-08         yes 6.941999e-06
#> 3    BFL1   PEC11         3 9.998885e-01          no 1.000000e+00
#> 4    BFL1   PEC12         5 9.999247e-01          no 1.000000e+00
#> 5    BFL1   PEC13         5 9.981700e-01          no 1.000000e+00
#> 6    BFL1   PEC14         1 9.965970e-01          no 1.000000e+00

# visualize the loaded data on a oxford grid :
plot_oxford_grid(my_orthologs_table,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# Visualize the results of the test of significance :
plot_macrosynteny(macrosynteny_df,
                  sp1_label = "B.floridae",
                  sp2_label = "P.echinospica")

3 - Reorder chromosome levels to group the linkage groups in clusters :

3.1 - Automatic reordering using a network-based greedy algorithm :

Reordering the chromosomes using a network based greedy algorithm can be performed by calling the function reorder_macrosynteny. It returns an orthologs_df with reordered levels in sp1.Chr and sp2.Chr. These columns are factors where the levels determine the plotting order. You’ll see the results of the clustering, when drawing the oxford grid of this newly generated orthologs data.frame


# visualize the loaded data on a oxford grid :
my_orthologs_table_reordered <- reorder_macrosynteny(my_orthologs_table)
plot_oxford_grid(my_orthologs_table_reordered,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# compute significance and visualize on a dotplot :
macrosynteny_df_reordered <- compute_macrosynteny(my_orthologs_table_reordered)
plot_macrosynteny(macrosynteny_df_reordered,
                  sp1_label = "B.floridae",
                  sp2_label = "P.echinospica")

3.2 - Manually reorder/subset the Chromosomes :

If you would like to subset some chromosomes of interest and manually reorder them you can still take advantage of functions implemented to handle data.frames. This task is out of the scope of this package, and can achieved using base R :

# select only the orthologs falling in the chromosomes of interest and plot: 
subset_of_orthologs <- subset(my_orthologs_table, sp1.Chr %in% c("BFL13","BFL15","BFL2","BFL3") & sp2.Chr %in% c("PEC2","PEC5","PEC11"))

plot_oxford_grid(subset_of_orthologs,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# reorder :
subset_of_orthologs$sp2.Chr <- factor(subset_of_orthologs$sp2.Chr,levels = c("PEC5","PEC11","PEC2"))
plot_oxford_grid(subset_of_orthologs,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

# Compute and plot macrosynteny :
macrosynteny_of_subset <- compute_macrosynteny(subset_of_orthologs)
plot_macrosynteny(macrosynteny_of_subset,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica")

4 - Plot directly with reordering the linkage groups and coloring :

The reordering can be performed on the row when calling plot_oxford_grid() by setting the reorder argument to TRUE.


# visualize the loaded data on a oxford grid  with reordering and coloring by cluster :
plot_oxford_grid(my_orthologs_table,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica",
                 reorder = TRUE,
                 color_by = "clust")

# redo and color by sp1.Chr instead :
plot_oxford_grid(my_orthologs_table,
                 sp1_label = "B.floridae",
                 sp2_label = "P.echinospica",
                 reorder = TRUE,
                 color_by = "sp1.Chr")