Dual isoform sequencing reveals complex

Analysis of AcMNPV transcriptome using third-generation sequencing

In this study, PacBio Sequel and ONT MinION LRS platforms were used to characterize the structure of AcMNPV transcriptome and epitranscriptome (Fig. 1). Sequel sequencing yielded a total of 47,880 Circular Consensus Sequences (CCS), 25,371 of which mapped to the viral genome and 23,884 to the insect host (Sf9 cells) genome. The total read count was less than the sum of the two mapped read counts because of the eliminated chimeric reads formed during library preparation mapped to both of the genomes. The Cap-selected samples yielded a total of 1,830,476 reads, 198,516 of which mapped to the AcMNPV genome and 1,631,960 to the host genome, whereas the non-Cap selected samples yielded 1,119,716 reads, 290,039 mapping to the viral and 760,533 to the host genome. Sequel sequencing yielded longer mean mapped read lengths than the ONT, while the Cap-selected and non-Cap selected ONT reads had similar mapped read lengths (Table 1). The difference in mean read-length between the two platforms is explained by a step during Sequel library preparation used to mitigate the loading bias of PacBio sequencers, resulting in the loss of short cDNAs. Further details on read counts and read lengths are shown in Supplementary Table 1, and the boxplot of Supplementary Fig. 1 presents the distribution of read lengths per sample.

Table 1 Read counts and read lengths of the samples used in this study.

The LoRTIA software suit (developed in our laboratory) was used to annotate viral transcripts and also our in-house script augmented by manual checking. The criterion of acceptance of TSSs and TESs as true transcript ends was to identify them by the LoRTIA software suite in two amplified ONT samples and in another technique that was either sequel or Cap-sequencing. A more stringent criterion was applied for the non-coding transcripts and 5′-truncated transcripts: in addition to 2 amplified ONT samples, Cap-selection had to be confirmed.

After the screening, a total of 311 TSSs and 261 TESs (Supplementary Table 2A,B) as well as 13 splice junctions were obtained (Supplementary Table 2C). TATA boxes were identified for 60 TSSs. The mean distance of a TATA box from the TSS was 32 nts. Twenty-two GC boxes were identified, and their average distance from the TSSs was 66 nts. The average distance of the identified 15 CAAT boxes from the TSSs was 108 nucleotides (nts). The canonical CAGT initiators were present in only 6% of TSSs, the TAAG initiators in 61%, and the non-TAAG initiators in 33% of the cases (Fig. 2a). A total of 875 transcript species were detected (Table 2). The full transcript list is available in Supplementary Table 3A, the abundance of transcripts is available in Supplementary Fig. 2 and Supplementary Table 3B, and the transcripts themselves are depicted in Fig. 3.

Table 2 The number of previously annotated and novel transcripts of AcMNPV.

Approximately 80% of TESs were found to contain a canonical PAS at an average distance of 27.23 nts upstream from the TES. In line with previous findings describing arthropod polyadenylation signals⁵¹, the ± 50 nts surrounding of the viral TESs was characterized by A/U-rich sequences with an increased adenine content immediately upstream of the cleavage site. Intriguingly, sequences harboring a PAS showed a slight increase of adenines between − 26 and − 12 nts upstream from the TES, whereas in the ones without a PAS, this phenomenon could not be observed (Fig. 2b).

UTR isoforms

In this part of the study, 330 transcript isoforms were found to have longer or shorter 5′-UTRs than the previously annotated transcript encoded by the same genes. Less than half (32.06%) of them showed an E/L initiation region shift when compared to the initiator element (Inr) of their previously annotated transcript. 7.06% of TSS transcript isoforms were revealed to be controlled by TAAG-Inr in the previously annotated isoforms being controlled by non-TAAG-Inr and encoded by the same gene. However, we identified non-TAAG-Inr isoforms in 25% of the transcripts that were previously annotated as TAAG-Inr. This phenomenon suggests that several AcMNPV genes are transcribed by both the host and the viral RNP, resulting in altered 5′-UTR lengths. In addition to the genes described in a previous work¹², we detected 57 genes, which were transcribed by both the host and the viral RNP (Supplementary Table 4A). Polymorphism in the 5′-UTR length probably has any biological relevance, but we could not exclude that it represents mere transcriptional noise. In many cases, longer 5′-UTRs harbor upstream ORFs (uORFs), which have been shown to alter the translation of the protein coding sequence by ribosome reinitiation, ribosome stalling or disassociation and ribosome bypass^51,52. We identified 75 gene products containing at least one uORF (Supplementary Table 4B).

All in all, 340 novel TES isoforms were identified in this work, 76.35% of which contained a canonical PAS upstream of their 5′-ends. The phenomenon of nontemplated adenine-addition by the viral RNP has previously been described¹⁸, and this in-vitro study has also suggested the presence of a T-rich termination signal for this enzyme, and nontemplated thymine addition preceding adenine incorporation. In concordance with this work, we found that 51.85% of the 3′-UTR isoforms with a LIS terminated in the near vicinity (± 3 nts from the TES) of a T-rich region. This finding is in contrast with the 22.51% of the 3′-UTR isoforms with non-TAAG-Inr. However, we could not confirm the presence of nontemplated thymines upstream of the poly(A) tail.

The mean read length of transcripts was 1423.7 nts (σ = 913.190) (Supplementary Table 1). Intriguingly, RNAs transcribed by the viral RNP were on an average of 500 nts longer than those transcribed by the host RNP. The mean 5′-UTR length was 153.06 nts (σ = 270.438), and the mean 3′-UTR length was 529.09 nts (σ = 729.266) both measured from the first ORF overlapped by the transcript. The difference was significant for the transcript length and 3′-UTR length, suggesting the tendency of viral RNP to produce longer RNA molecules.

Monocistronic transcripts

Several AcMNPV genes lack a precise transcript annotation because of the challenges facing SRS when assembling a genomic region with a complex transcriptional overlapping pattern. Using LRS, we annotated 14 novel monocistronic transcript species with single-nucleotide precision (Supplementary Table 3A). Canonical TATA boxes were observed upstream of the TSSs of ORF85 and ORF112-113. These transcripts started from a non-TAAG-Inr and harbored a canonical PAS upstream of their TES. The transcripts coding for DNA polymerase were initiated at a canonical arthropod initiator (GCATA), while helicase was initiated at a similar but non-canonical sequence (GCAATA). Both DNAPOL and HEL harbored a canonical PAS upstream of their TESs. Nine of the transcripts (ORF1629, P47, ORF72, ORF84, 38K, ORF108, PP34, P49 and ORF154) originated at a TAAG-Inr, PP34 with a canonical CAAT sequence (CCAATC) 87 nts upstream from its TSS, and five of these transcripts had PASs.

Non-coding transcripts

We detected 101 novel transcript isoform species that did not contain any previously annotated ORFs, two-thirds being longer than 200 nts representing long non-coding RNAs (lncRNAs), while one-third of them fell in the size range of short non-coding RNAs (sncRNAs). We identified 41 sense ncRNAs all of which overlapped a canonical transcript but lacked the stop codons or both the stop codons and a certain part of the 5′-UTRs. Only 10.3% of the ncRNAs had a canonical TATA promoter upstream of their TSSs, whereas 70.5% of them started at a TAAG initiator, which may suggest their late transcription. Twenty-three antisense RNAs (asRNAs) were identified being controlled by their own promoters. These asRNAs were encoded by the complementary DNA strands of 11 genes (Supplementary Table 3A). ORF-60-AS-1 was the only asRNA the promoter of which contained TATA box, whereas 86% of them contained TAAG initiator sequence. All of the ncRNAs were also present in the Cap-selected samples.

Putative 5′-truncated mRNAs with in-frame ORFs

We detected 41 putative novel genes, which produced 5′-truncated version of the canonical mRNAs and contained shorter in-frame ORFs. Nineteen of these transcripts were initiate at TAAG sequence, 9 of which had a previously annotated isoform (EGT, DNAPOL, HCF1, PNK/PNL, HEL, HE65, 94K, IE1 and IE2) initiated at a non-TAAG-Inr, which may suggest that early genes were partially transcribed by the viral RNP at late time-points. Intriguingly, eleven of the previously annotated transcripts (AC-BRO, POLH, ORF19, PP31, ORF66, ORF84, ODV-E25, BV/ODV-C42, ORF117, CHIT, ODV-EC27) originating at a TAAG sequence had 5′-truncated isoforms that were initiated at non-TAAG-Inrs. It implies the transcription of 5′-truncated isoforms of some late genes by the cellular RNP. These transcripts were all present in the Cap-selected samples.

Multigenic transcripts

Several long (more than 760 nts) multigenic RNA molecules were detected in the viral transcriptome. We designated polycistronic transcript species to the ones that exclusively contained tandem ORFs, whereas complex transcripts were defined as multigenic transcripts that contained at least one ORF in the opposite orientation as the rest of the ORFs. In this study, we detected 241 polycistronic transcript species containing at least two ORFs. The main initiator motif of these long transcripts was LIS, as 81.74% of the transcripts started at a TAAG sequence. A total of 79 complex transcript species were detected, 21 of which were transcript isoforms, while the rest of them were mapped to unique genomic locations. The longest complex transcript P10-74-ME53-C-1 had only a single sense and two antiparallel ORFs, while ORF51-52-53-LEF-10-ORF54-55-56-C-1 had the highest number of ORFs (6 sense and 1 antiparallel ORFs).

Replication origin-associated transcripts

The homologous repeat (hr) regions are located in multiple genomic positions in AcMNPV. They are believed to contain the replication origins (Oris). Our LRS approach detected overlapping transcriptional activity at all of the 9 h sequences. However, in the case of h5, LoRTIA did not identify transcripts. Despite this fact, we could detect reads without exact TSSs and TESs. Altogether, 55 transcript species were detected at the hr regions, 50 of which contained a TAAG initiator sequence. Fifteen of these RNAs were polygenic, 32 were TSS variants and 3 were TES isoforms, while 8 were monocistronic transcripts. Most of the overlapping transcripts (12) were transcribed at the genomic junction (at hr1) of the circular viral DNA, 7 of which were complex transcripts, 4 were TSS isoforms, and 1 was a monocistronic RNA.

Splice isoforms and transcripts with retained introns

Chen et al.¹² have previously reported twelve introns with an abundance above 1%. We detected five additional introns (Table 3). Twelve of the introns detected in this study contained the canonical GT/AG splice junctions, while a single one a less common GC/AG. Chen and colleagues have associated a spliced antisense transcript to ORF115. We detected 2 similarly positioned RNAs (ORF117-L-SP-1 and ORF117-L-SP-2), and ORF117-L-SP-1 had matching introns with the previously annotated transcript. ORF117-L-SP-2 had the same acceptor site position, but its donor site was located 85 nts downstream from the previously annotated genomic location. We could not precisely annotate the TSS of these transcripts; however, according to our data, it was located upstream of ORF117-L-1’s TSS. Splicing of the ORF117-L-SP-2 led to frame shifting within the previously annotated ORF, and to the generation of a novel 246 nt-long ORF, originated upstream of the original ATG.

Table 3 TSS, TES, splice junction, ATG and stop codon positions of the spliced AcMNPV transcripts.

Transcriptional overlaps

Theoretically, transcripts can overlap each other in three different ways: convergently, divergently, and parallelly. The AcMNPV genome contains 37 convergent gene pairs. Our LRS approach demonstrated that all of the convergent gene pairs produced transcriptional readthroughs, only 3 pairs of which overlapped exclusively at their 3′-UTRs, while the others overlapped the ORFs (Supplementary Table 5). We detected 32 divergent transcriptional overlap out of the 34 gene pairs, and 84 parallel overlaps out of the 87 gene pairs. We assume that a higher data coverage would detect overlaps in every transcript.

5-mC methylation

We used dRNA-Seq and bisulfite conversion data for the detection of methylated nucleotides of AcMNPV transcripts.

Tombo analysis

Tombo is a software package used for the identification of modified nucleotides from nanopore sequencing data. In order to decrease false positive results in the dRNA-Seq sample, transcripts were filtered with a coverage lower than 30, and the ones the modified fraction of which was less than 30%. We found no significant correlations between the coverage and the number of methylated nucleotides in the raw fraction (Fig. 4a). Using the Tombo software, a possible methylation consensus sequence (UUAC*CG) (the modified C letter is labelled with asterisk) was identified, which indicated the right distribution of log-likelihood ratios (Fig. 4b). Our bisulfite conversion experiment confirmed the methylation of this consensus sequence. The deviation from the canonical C sites could also be clearly detected (Fig. 4c). After identifying the potential false positive sites, we obtained 325 putative 5-mC methylation positions in 12 viral genes (ac-39k, ac-bro, ac-ctl, ac-odv-e25, ac-orf-58, ac-orf-73, ac-orf-74, ac-orf-75, ac-p40, ac-p6.9, ac-polyhedryn, and ac-vp39). The deviation from the canonical C sites could also be clearly detected (Fig. 4c).

Bisulfite conversion analysis

Besides the Tombo analysis of dRNA-Seq data, we also carried out bisulfite conversion experiments. While low read counts (2710 viral reads) were obtained in the dRNA-Seq, a much higher read count (125,448 reads) was generated by the bisulfite sequencing. Furthermore, in the latter method, positive control (non-converted samples) was also used. We could confirm 234 methylated positions out of the 325 (identified by Tombo analysis) with the bisulfate sequencing. In order to decrease the false positive results, we set a coverage of 25 as the threshold for the bisulfate analysis. Altogether, 7897 putative methylation positions were identified in the transcripts of 99 genes (Supplementary Table 6 and Supplementary Fig. 3). We detected 31 potential cytosine positions in the 3′-UTR of the ac-Orf-12 transcript, which were always unconverted, and therefore methylated. Altogether, 88% of the examined potential methylation positions (positions the coverage of which was at least 25) were located at the coding regions and 21% at the UTRs.

A to I RNA hyper-editing

Reads of ORF19-L showed a high frequency of A to I (read as A to G by the sequencing) substitution, which was not present in overlapping reads. We found that 50% of all substitutions were A to G (Fig. 5a,b) for ORF19, which was significantly higher than the 16.9% for overlapping transcripts in the same region (p < 0.0001, sided Fisher’s exact test) (Fig. 5c). A substitution threshold of 16.9% was set to distinguish possible edited bases from the noise of sequencing inaccuracy. Our results showed that 18% of all adenines of ORF19-L presented a high level (x̅ = 0.839, σ = 0.153), while 4% of adenines of the overlapping reads presented a low level of A to G editing (x̅ = 0.224, σ = 0.051). To identify the presence of a possible editing motif recognized by ADAR, we calculated the base frequency in the ± 5 nts surrounding the edited A. It has been previously demonstrated that a G-enriched neighborhood and an upstream U stabilize the RNA-ADAR complex in mammalian cells⁵⁴. We detected a significantly higher frequency of Us (χ²(1, N = 79,455) = 79,338.023, p < 0.01) right upstream of the edited base, while the frequency of Gs was only slightly higher (χ²(1, N = 79,454) = 79,340.021, p < 0.05) at the + 5 position downstream of the edited base.

Dual isoform sequencing reveals complex transcriptomic and epitranscriptomic landscapes of a prototype baculovirus | Scientific