Detecting purifying selection upstream of annotated coding regions
Purifying selection is a typical evolutionary signature of protein-coding sequences. One would expect that if translation initiates upstream of the annotated start codon, this upstream region should evolve as a protein-coding sequence. We used this indicator for the identification of N-terminally extended proteoforms in the human genome utilising PhyloCSF score. PhyloCSF is a state-of-the-art method for assessing the evolutionary protein-coding potential of a genomic region based on multiple sequence alignment38. However, the protein-coding evolution of an upstream region does not necessarily mean that the underlying mechanism is the non-AUG initiation or alternative transcription (Fig. 1c).
Determining the exact start of non-AUG extensions in practice can be quite challenging because there might be several non-AUG starts upstream of AUG like in the case of PTEN. So we focused on predicting the genes which are most likely to have alternative extended non-AUG proteoforms irrespective of our ability to identify specific locations of start codons. The final set of candidates (PhyloSET, see Supplementary Data 1A) was obtained via the following steps (see Methods for further detail) shown in Fig. 2, steps 1–4a. PhyloCSF score is ranging from 0.1452 (NRXN1) to 2693.8893 (CCDC8) with a median value of 155.9191 (TRPC1). PhyloCSF scores per codon were calculated to observe how the selection changes over the selected upstream region. Ideally, we would expect that the start of extension is clearly separated from the non-coding sequence, in other words, PhyloCSF score becomes positive at the border between the extension and preceding non-extension part as it happens for the CCDC8 gene (Fig. 3a). However, for the majority of genes, no such clear change in scores can be spotted perhaps because evolutionary selection on N-terminus is relaxed in comparison with internal parts.
Detecting translation upstream of annotated coding regions with Ribo-seq and proteomics data
Another set of candidates (RiboSET) was selected solely based on ranked translated extensions predicted using ribosome profiling data (both elongating and initiating) with Trips-Viz, (Fig. 2, steps 1–3, 4b and Supplementary Data 1B). Trips-viz is a computational data environment for analysing Ribo-seq data on a transcriptome level39,40. It contains thousands of uniformly processed public Ribo-seq data and provides tools for analysis and visualisation of translation. Our recent addition to the Trips-viz platform is Ribo-seq ORF predictor which outputs a list of ranked translated ORFs. The biggest advantage of the tool is that a large number of processed public Ribo-seq data is already available to users and they can apply ORF predictor and visualise results immediately. The tool is tailored to detect different types of ORFs including uORFs, N-terminal extensions (NTEs), nORFs, CDSs and dORFs. The algorithm for NTEs is based on triplet periodicity present across the entire length of the extension and utilises patterns of ORF translation such as consistency of ribosome footprint triplet periodicity within the reading frame, the increase of footprint density at the potential start, non-zero coverage and average read density (see Supplementary Methods). Of note, it automatically filters out regions that overlap with coding exons. We employed this algorithm to detect non-AUG N-terminally extended proteoforms using aggregated elongating and initiating ribosome profiling data with high triplet periodicity scores (Supplementary Data 8).
In the absence of an objective threshold for discriminating genuine translation from biological and technical noise, we decided to incorporate 500 top scoring extensions into RiboSET which upon further filtering (Methods) was reduced to 390 genes. Although important to mention that the actual number of translated non-AUG extensions is much higher (multiple extensions ranked below 5000 are reliably translated). Proteomics data available in Trips-Viz40 supported extensions in 90 genes in this RiboSET (18 in PhyloSET). Only eight genes (CCDC8, CYTH2, FXR2, H1FX, HNRNPA0, MARCKS, RPTOR and SFPQ) are common between PhyloSET and RiboSET (Fig. 3a, b and Supplementary Fig. 1). Overlap between RiboSET and PhyloSET is statistically significant (hypergeom. padj = 3.59e-05).
The small overlap between PhyloSET and RiboSET requires an explanation (more detailed in Discussion). Genes may occur in PhyloSET exclusively either because they are not expressed in the cells for which Ribo-seq data are available or because the 500 top-ranking threshold is too conservative. To explore the first possibility we studied a relationship between the rank of predicted extension and CDS coverage—the average number of footprints at CDS. We observed that the lower the average number of footprints at CDS, the lower extension is ranked (Fig. 3c). To show whether a threshold of 500 top-ranking candidates is too conservative, we explored the size of the overlap increase (we use step = 500) between PhyloSET and RiboSET depending on the ranking threshold (Fig. 3d). It appears that until the rank reaches 6500, there is a statistically significant (p = 0.0016, Mann–Whitney U-test) increase in the overlap size in comparison with what would be expected by chance suggesting that Ribo-seq signal above our selected threshold has a clear positive value at predicting genuine N-terminal extensions. Genes with high CDS coverage but with no rank may still have extensions translated under certain conditions (examples can be found in Supplementary Fig. 2). Thus the top-500 cutoff is very conservative. While this threshold avoids false positives, it also generates many false negatives. Hence, we also generated an extended dataset with a more relaxed threshold equal to 5000 which after filtration resulted in 3451 genes (Supplementary Data 2).
Comparison with previously identified non-AUG proteoforms
We also wanted to know how well translation detected with ribosome profiling concords with phylogenetic approaches. We performed comparisons of a set of predicted proteoforms from the previous study22 with PhyloSET and RiboSET. In brief, the study utilised alignments of human and mouse RefSeq transcript sequences which resulted in the prediction of 59 genes with evolutionary conserved extensions. PhyloSET and RiboSET are meant to have only new non-AUG proteoforms which have not been described in GENCODE v35 and the latest RefSeq annotation (due to the exclusion of overlapping coding exons). In GENCODE v35, 24 non-AUG proteoforms from study22 have been annotated (we called this set ‘ann_24’); 28 genes have not been annotated with non-AUG proteoforms and retained intact extension sequences detected in the previous study (‘un_28’) and 7 genes remained without annotated near-cognate initiated proteoforms or intact extensions (‘diff_utr’, HELZ2, ANKRD42, WDR26, ZFP62, C1QL1, PTEN, TIAL1, where WT1 was shown to be annotated and intact extension in TIAL1 still corresponds to only nonsense-mediated decay transcript, (Supplementary Data 3). Among 28 genes there were four genes found in PhyloSET and two genes in RiboSET. Among seven genes one gene and zero genes were shown in PhyloSET and RiboSET correspondingly (Fig. 3f). As could be expected overlap between ‘un_28’ and PhyloSET is statistically significant in comparison to all protein-coding genes (hypergeom padj = 2.838e-06) while the overlap between ‘un_28’ and RiboSET is not significant (hypergeom padj = 0.122).
We also decided to test whether a PhyloCSF-based approach is able to re-identify those 24 genes (‘ann_24’) which have been already annotated in GENCODE v35. First, we created a set of transcripts with non-AUG proteoforms for these genes where the start of CDS is moved to downstream AUG (Fig. 3e). It turned out that the PhyloCSF score is positive for less than half of genes (11/24, Supplementary Data 3). The remaining 13 genes have not shown a positive PhyloCSF score. It might be explained by the extended part being much shorter than 50 codons which may have led to an excessive codons’ impact on negative scores (MYC, YPEL1, HCK and TRPV6). We also compared the distribution of PhyloCSF scores for upstream regions of ‘ann_24’ genes with the moved start of CDS and ‘un_28’ genes (Fig. 3g). The already annotated genes (‘ann_24’) have significantly higher PhyloCSF score (Mann–Whitney, p value = 0.0086) than not yet annotated ones (‘un_28’).
When compared to the previous study (ref. 22), we observed a statistically significant but nevertheless small overlap (four genes) with PhyloSET. One would expect that due to both studies using similar approaches (phylogenetic analysis in mammalian species), the overlap should be higher. This can be explained by multiple reasons including that 24 genes out of a total of 60 have already been annotated in GENCODE v35 and thus were excluded from our analysis and seven genes have unmatching sequences of 5’UTRs in comparison to RefSeq which was used in a previous study. Only 28 genes were left available for discovery. We used multiple sequence alignments of 120 mammals in contrast to ref. 22 where only paired alignments of human and mouse were studied. It could explain why such a small overlap was observed – the probability of upstream regions to be conserved in a wider range of species is generally lower than in just two species. Also, the length of the region we took for evaluation is 50 codons-long which simply might be longer than the actual extension thus leading to a negative PhyloCSF score derived from excessive upstream non-conserved triplets not included in the actual extension. The same reason may explain why we could re-identify only half of the genes already annotated in GENCODE v35 using our phylogenetic approach. Nevertheless, PhyloCSF score of extensions in already annotated genes from ref. 22 study is significantly higher than in upstream regions of not yet annotated genes (Fig. 3g).
We revisited Ribo-seq profiles for ‘ann_24’ genes (Supplementary Figs. 7, 8). Lack of Ribo-Seq data coverage was shown for the entire mRNA of 7 genes (FNDC5, NR1I2, PRPS1L1, TRPV6, HCK, YPEL1 and OAZ3); extensions are clearly supported by Ribo-seq data in 16 genes, although for KCTD1 it is not clear whether extension starts where it is annotated.
In addition, we also compared our gene sets with Van Damme et al. study31 which identified 17 human genes with non-AUG N-terminal extension using ribosome profiling and N-terminal proteomics (Supplementary Data 3). Among 17 candidates, two genes from PhyloSET (FXR2 and HNRNPA0) and seven genes from RiboSET (NARS, HDGF, HNRNPA0, FXR2, SYAP1, KAT7 and BAG6) were present. Both overlaps between Van Damme et al. study and PhyloSET or RiboSET are significant (hypergeom. padj = 0.0016 and 1.002e-07 correspondingly). We also compared RiboSET and PhyloSET gene lists with extensions detected with the N-terminal-peptide-enrichment method from the study Yeom L. et al41. Only 1 gene from PhyloSET and 17 genes from RiboSET were present among 171 genes with N-terminal extensions (Supplemental Data 3). Overlap between Yeom L. et al study and RiboSET is significant (hypergeom. padj = 1.322e-07) in contrast to PhyloSET (hypergeom. padj = 0.403).
Of note, there was a significant discrepancy between RefSeq and GENCODE gene annotations in PTEN. In the latest RefSeq mRNA the CUG-initiated proteoform is annotated correctly, while in GENCODE v35 this proteoform has not been annotated yet and 5′ leader of the only one available transcript ENST00000371953. This can be explained by the incorrect sequence of the reference genome (assembly GRCh38)—it has the variant which is known as NC_000010.11:g.87864104delT and its global minor allele frequency in 1000 Genomes 0.00000 (T). This variant introduces frameshift into the 5′ leader of a transcript thus disrupting the sequence of CUG-extension. In RefSeq gene annotation this variant is cut from the transcript sequence and shown as 1nt-intron in Genome Browser (Supplementary Fig. 9).
Characterisation of genes with predicted non-AUG initiation
Firstly, we studied the distribution of start codon type across starts predicted by Trips-viz in RiboSET. As expected, the most frequent non-AUG start in extensions was CUG, followed by GUG and ACG (Fig. 4a and Supplementary Fig. 10e). In RiboSET only one non-AUG initiation site per transcript is predicted based on internal probability ranking relying on features associated with the intensity of Ribo-seq signal. Therefore, one would expect that the initiation efficiency of such starts may be facilitated by certain features including the optimality of Kozak context and downstream secondary mRNA structures. It has been shown that certain non-AUG start codons with the appropriate sequence context can initiate translation comparable to that of AUG start codons42. The efficiencies of TIS (Translation Initiation Site) including the start codon and four positions upstream and downstream were previously measured with FACS-seq. For this purpose, a library of fluorescent reporters under control of all possible contexts surrounding near-cognate initiation codons was transfected into cells and then cells were sorted based on fluorescence. The efficiency of a specific TIS was measured based on its enrichment within a specific fraction and scaled relative to the optimal TIS (CACCAUGG) efficiency score set to 10042 (the scores of non-AUG starts were found in the range from 0.2 to 50.4). We compared TIS scores of predicted non-AUG starts from RiboSET (Supplementary Data 4) to all other non-AUG codons in theoretical extensions in RiboSET and PhyloSET (Fig. 4c and Supplementary Fig. 10d). It is clearly seen that predicted non-AUG starts in RiboSET have more favourable initiation contexts in comparison to all theoretical non-AUG codons in primary extension sequences thus endorsing Trips-viz-based prediction method (Kolmogorov–Smirnov two-sample test, p value 1.19e-18, statistic = 0.245). Similarly to AUG, the optimal sequence context of predicted non-AUG starts tends towards having guanosine in −4 and +4 positions (Fig. 4d and Supplementary Fig. 10f).
The next step was to assess the stability and presence of mRNA secondary structures located downstream of predicted start codons. It has been shown that a strong RNA secondary structure located downstream of the initiation site significantly increases the efficiency of initiation at non-AUG codons19,43. We selected 400 genes with high-scored AUG-containing TISs and 400 genes with low-scored AUG-containing TISs as well as all predicted non-AUG TISs from RiboSET (390 genes). We then used RNAfold44 to calculate the free energy of predicted RNA secondary) within a sliding window of 22nt with the step of 1nt in the region surrounding the potential start codon (10nt upstream and 100nt downstream), see Fig. 4b and Supplementary Fig. 10c. As expected, more stable mRNA secondary structures were present on transcripts with less optimal TIS codons.
Next, we compared PhyloCSF scores of upstream regions of RiboSET genes and genes with no translation upstream. In brief, we created a set called UntranslSET containing 384 genes with no translation upstream in theoretical N-terminal extension but with well-translated CDS and a theoretical extension length is at least 20 codons (Methods). Interestingly, it turned out that both PhyloCSF score distribution of RiboSET and UntranslSET are skewed towards negative values (Fig. 3h), however, we can clearly see (Kolmogorov–Smirnov, p value = 2.23e-06) that PhyloCSF score of RiboSET genes is generally lower than of UntranslSET. More importantly, the PhyloCSF score of upstream regions of RiboSET_ext is also skewed towards large negative values providing us with evidence of a lack of evolutionary selection among mammals for thousands of translated non-AUG extensions (Supplementary Fig. 10a). Such a significant difference in PhyloCSF scores between UntranslSET and RiboSETs might be explained by the length of the theoretical extensions (Supplementary Fig. 10g, h)—upstream regions in UntransSET are shorter than in RiboSET (RiboSET_ext) therefore PhyloCSF score might be higher for them.
One would expect that proteoforms with different alternative N-termini (PANTs) may possess different functional properties. For instance, longer proteoforms may contain a signal of subcellular localisation in their extended part for alternative compartmentalisation33,45,46. Functional gene enrichment analysis was performed using the Gene Ontology resource47,48 using simplified GO terms49. Significant enrichment for genes from RiboSET was shown in 9 terms in ‘cellular component’ and 4 terms in ‘molecular function’. We also observed overlaps between terms, e.g. 58 genes associated both with term ‘nucleoplasm‘ and ‘cytoplasm’, 29 common genes between ‘cytoplasm’ and ‘membrane’ and 16 genes between ‘nucleoplasm’ and ‘membrane’ terms (Supplementary Data 5, Fisher’s exact test, p values are adjusted with Benjamini–Hochberg correction). No enrichment was shown for PhyloSET genes. We also repeated the analysis for RiboSET_ext and found similar patterns (Supplementary Data 5, Fisher’s exact test, p values are adjusted with Benjamini–Hochberg correction).
In eukaryotes, N-terminal targeting signals include mitochondrial targeting signal and the signal sequence for the secretory pathway (signal peptides)50. Membrane proteins may also contain a signal peptide, but most often the N-terminal transmembrane (TM) region functions as the signal sequence51. First, we extracted the main subcellular location of proteins based on immunofluorescently stained cells from The Human Protein Atlas (HPA)52,53. We split genes into two groups based on the number of alternative cell compartments: 1 or at least 2 (Fig. 4e and Supplementary Fig. 10b). We found no significant enrichment in multiple localisation for genes in RiboSET or PhyloSET (and RiboSET_ext) in comparison to UntranslSET (Supplementary Data 6).
Next, we utilised algorithms for the prediction of localisation signals in the extended proteoforms from RiboSET. SignalP 5.0 is a deep neural network approach that detects the presence of signal peptides and the location of their cleavage sites in proteins54. We utilised the web-server interface for extended proteins from RiboSET, PhyloSET, UntranslSET and RiboSET_ext. Given that signal peptide length varies from 16 aa to 30 aa55, 10 genes from RiboSET (53 genes from RiboSET_ext) turned out to have signal peptide at least partially within extension; no genes in PhyloSET with signal peptide within theoretical extensions were found (Supplementary Data 6). Among genes with predicted signal peptides, in RiboSET there were 7 genes (44 in RiboSET_ext) with detected signal peptides residing entirely within the N-terminal extension including RAE1 (Fig. 4g).
We also explored the possibility that extended parts of proteoforms can target them to mitochondria. We applied TargetP 2.056 and given the length of mitochondrial presequences is 20–60 amino, we found 18 genes in RiboSET (83 in RiboSET_ext) and no genes in PhyloSET with mitochondrial signal in their extensions (Supplementary Data 6). For 12 genes (57 in RiboSET_ext) including RFK (Fig. 4f) the cleavage site of mitochondrial transfer peptide was located within the extension part.
Next we explored the existence of N-terminal transmembrane helices within predicted extensions using the transmembrane domain prediction ability of DeepTMHMM57. We found two genes from RiboSET and two genes from PhyloSET (12 genes in RiboSET_ext, Supplementary Data 6) with the first TM helix located at least partially within the extension (Fig. 4h).
Next, we tested whether SignalP, TargetP, and DeepTMHMM detection of compartmentalisation signals is more common in translated or conserved upstream extensions than in equivalent genes with no upstream translation (UntranslSET). The occurrence of mitochondrial presequence predicted with TargetP in RiboSET was shown to have significant enrichment of presequences in comparison to vs UntranslSET (p value = 0.022) although after multiple testing correction the p value did not retain significance. Also, no TM helices were found in UntranslSET unlike 2 and 2 genes in RiboSET and PhyloSET (12 genes in RiboSET_ext). For other predictors, there was no significant enrichment of upstream regions of RiboSET (RiboSET_ext) and PhyloSET genes in predicted localisation signals in comparison to genes with no detected translation (Fisher exact test, alternative = ’greater’, padj > 0.05, Supplementary Data 6). Taken together, no strong association between alternative localisation signals and non-AUG extensions were found.
Exclusive non-AUG initiation
One intriguing aspect of non-AUG initiation is that it can be exclusive, which means that unlike in case of PANTs non-AUG initiated proteoform is the only one synthesised from a transcript. This suggests that the non-AUG initiation function might be different from the production of alternative proteoforms. There are very few known examples of sole non-AUG initiation e.g. EIF4G2, TRPV6, TEAD1 and STIM2 and the reason why non-AUG is preferred evolutionary over AUG has not been elucidated yet. Here we reported several examples of most likely exclusive non-AUG initiation according to their Ribo-seq profiles (Fig. 5, Supplementary Figs. 11, 12, Supplementary Data 7 and Supplementary Notes).
Reannotation of non_AUG proteoforms
One of the goals of this study is updating human gene annotation with newly identified non-AUG proteoforms as well as reannotation of incomplete or incorrect transcript isoforms discovered along the way. GENCODE also maintains annotation of mouse genes, so human orthologs in mouse have also been incorporated. Of note, annotation requires not only the information about extension being present for that gene, but also an exact position of translation initiation which is challenging to infer precisely due to multiple reasons.
Therefore, the initial phase of annotation includes several immediate cases. Thirty genes have maintained canonical AUG start is their models in human, while additional models with non-AUG start have been introduced: SFPQ (CUG), VANGL2 (AUA), CCDC8 (CUG), PELI2 (CUG), CYTH2 (CUG), FXR2 (GUG), H1F or H1-10 (CUG), RPTOR (CUG), USP19 (AUA), SLC6A1 (CUG), NPLOC4 (GUG), SLC25A32 (UUG), ADO (CUG), JUN (CUG), HDGF (AUU), POGZ (ACG), BRD7 (CUG), PIM2 (CUG), PTMS (CUG), CARM1 (CUG), TARBP2 (ACG), CHTOP (GUG), TNKS2 (UUG), KAT7 (ACG), FHIP2A (CUG), SYNCRIP (CUG), KCTD9 (GUG), PPP4R2 (CUG), ZNF384 (CUG) and SNRNP25 (CUG). Similarly, non-AUG extensions will be annotated in the next GENCODE release for mice orthologs (Sfpq, Vangl2, Ccdc8, Peli2, Cyth2, Fxr2, H1fx or H1f10, Rptor, Usp19, Slc6a1, Nploc4, Slc25a32, Ado, Jun, Hdgf, Pogz, Brd7, Pim2, Ptms, Carm1, Tarbp2, Chtop, Tnks2, Kat7, Fhip2a, Syncrip, Kctd9, Ppp4r2, Zfp384 and Snrnp25). One more gene (XRRA1) which was not included in RiboSET due to its rank (rank 711, below 500 threshold) has also gotten new transcript models both for mouse and human (GUG). Additionally, AUG-extended proteoform in human (PTPRJ) has been introduced.
Having a comprehensive and accurate gene annotation is crucially important for a wide research community, especially for clinicians who heavily rely on gene annotation with regard to variant interpretation. Since we discovered novel protein-coding regions, we overlapped ClinVar variants with predicted non-AUG extensions from RiboSET_ext and found 124 genes (201 variants) with either pathogenic, likely pathogenic or conflicting interpretations of pathogenicity (Supplementary Data 9). This is likely to be an underestimation of variation simply because only annotated coding regions are generally used for variant calling. Similarly, for primary extensions in PhyloSET we found gene GDF5OS carrying 4 variants (pathogenic and likely pathogenic, Supplementary Data 9). Therefore this set can be used for assessing variation occurring in 5′UTRs.