Transcript discovery using RNA-seq is an important step in refining eukaryotic genome annotation by complementing in-silico gene predictions with experimental evidence for genes, transcripts, splice variants, and their expression levels. Best practice is to include evidence from both long- and short-read NGS data, which reduces challenges stemming from inference of transcripts and their splice variants from partial observations or rarely used splice junctions resulting in read-through events.
The new Transcript Discovery Plugin for CLC Genomics Workbench and the CLC Genomics Server allows joint analysis of both long PacBio and short Illumina reads from RNA-seq libraries. The plugin works by mapping RNA-Seq sequencing reads to a genomic reference while permitting large-gap alignments (to account for introns), followed by a transcript discovery process where transcripts and genes are inferred from the read mappings. If genome annotations already exist for the genome at hand, these can be updated based on the experimental evidence from the RNA-seq data, thus generating new transcript and gene annotation tracks.
Central to the Transcript Discovery Plugin is the Large Gap Read Mapping step, which models the presence of introns in the genome reference sequence, but are expected to be absent from the corresponding transcriptome. RNAseq reads are iteratively mapped to the reference genome using the CLC Genomics Workbench read mapper. The Transcript Discovery step then predicts genes, transcripts, and coding regions (CDS) from the read mapping results, while considering prior knowledge on gene and transcript models and the coding potential of predictions. The estimation is also aided by several filters applied to the mapped reads, such that coverage, splice evidence and splice signatures are considered, while duplicated, non-specific matching, read-through and chimeric reads are disregarded.
Rich tables of predicted features with columns describing their various evidences can be viewed as genome tracks and inspected together with the read mappings. Furthermore, examination of the Rejected Events table can lead to insights on filtering settings that can be fine-tuned until sensitivity and specificity are in balance. An example of some of the output from this plugin can be seen in Figure 1.
The Transcript Discovery Plugin for the CLC Genomics Workbench comes with a fully documented manual and a quick-start tutorial. The manual both documents the use of the plugin, and contains tips on how to obtain the best results by a) optimizing the sequence in which iterative analysis steps are performed and b) conducting combined versus sequential processing of multiple replicates and tissues of origin.
The quick-start tutorial provides a step-by-step guide to performing RNA transcript discovery from your RNA-seq libraries. Part of a recently published dataset (1) containing both PacBio and Illumina reads from the Wild Strawberry Fragaria vesca is provided in the tutorial, which can be used to re-annotated the reference genome. In their paper, Li et al., 2018 annotated approximately 33,000 genes using the full dataset (available from SRA). To enable users to walk through the tutorial on a broad-range of computers, we provide a link to a smaller subset of the reads. Nonetheless, these still are enough to annotated over 14,000 genes, including several new genes and transcripts that were not presented in the original paper.
The Transcript Discovery Plugin is a free add-on for the CLC Genomics Workbench that provides an easy-to-use tool-set for evidence-based RNA-seq annotation of genomes. The plugin can be downloaded and installed directly from within the Plugin Manager CLC Genomics Workbench. Alternatively, an installer file can be download from our Workbench Plugins page, where a CLC Genomics Server version of the plugin is also available.