In a recent study published in Cell Reports, researchers performed a genomic analysis to investigate the origination of human microproteins of biological importance.
Studies have reported that sORFs (small open reading frames) encode functional microproteins essential for several biological processes. However, the origination and conservation of such microproteins have not been well-characterized. Genomic analysis of microproteins could deepen understanding of human genomic characteristics critical for functionality.
About the study
In the present study, researchers investigated the origin of functional human microproteins. They investigated cases wherein the proteins evolved from non-coding sequences and acquired biological importance.
The study comprised open reading frames translated in a previous study (Chen et al) and were reported in the human FANTCOM-CAT transcriptome dataset by Hon et al. The analysis was restricted to ORFs situated on noncoding transcripts (‘new’), located upstream of coding ORF genes (‘upstream‘), located downstream of coding ORFs (‘downstream’), or situated on transcripts devoid of coding ORF genes but belonging to transcript families with one coding member (‘new_iso’). The team matched ORF genes from the aforementioned two previous studies on the basis of their chromosomal coordinate similarity, 100.0% sequence identities, and comparable lengths.
In total, 715 ORFs, situated on 527 transcripts, were analyzed. Data on fitness effects, phenotypic scores, and classification based on their significance using induced pluripotent stem cells and obtained from previous studies. CPAT (coding potential assessment tool) was applied to ORF sequences to determine coding probability scores. Ribonucleic acid sequencing (RNA-seq) analysis data were mapped to their relevant genomic assemblies. Inference of orthologous transcription based on reference transcriptomes and expression data analysis was performed.
Further, orthologous genomic regions were identified, and the presence of ancestral ORFs was inferred, following which functional signatures were assessed. To estimate the origination timing for every ORF (i.e. the most ancient ancestor with intact ORFs), the team searched for orthologous chromosomal regions of the human ORFs in genomic data of 99 species of vertebrates. The team aligned the orthologous sequences of all ORFs subjected to PhyloCSF (phylogenetic codon substitution frequencies) analysis. ASR (ancestral sequence reconstruction) analysis was performed to infer the absence or presence of ORFs at human ancestor nodes based on ORF lengths.
The origination timing of microproteins was considered based on the first node at which ORFs and transcripts were detectable (putative origin) and was independent of the origination mode. In the case wherein ancestors lacking intact ORFs preceded ancestors possessing intact ones, the origination mode was termed de novo. Data on the origination timings of ORFs and transcripts were combined to infer the origination timing of microproteins with de novo origin. To evaluate the effect of ORF lengths, strict (50%) and relaxed (80%) de novo attribution values were assessed. The team investigated the biological importance/functionality of the de novo-emerged microproteins. All known single-nucleotide polymorphisms (SNPs) annotated as pathogenic or likely pathogenic were surveyed.
Of 715 ORFs analyzed, de novo origination was inferred by the team for 155 ORFs, with similar origination nodes for 148 ORFs and 102 ORFs, based on the relaxed and stricter cut-offs, respectively. De novo-origin upstream and downstream ORFs showed RNA-first origin. The findings indicated a continuing birth of functional microproteins de novo from the initial evolutionary period for mammals.
The team identified 19 putative origin functional microproteins that emerged de novo, of which 12 and seven were encoded on long non-coding RNA (lncRNAs) and coding transcripts, respectively. Two biologically important microproteins, CATP00001296115.1, and CATP00000751060.1, were found to have a putative origin post-chimpanzee-human split. Both proteins were expressed from lncRNAs and had ORF-first origin with short time intervals between the origination timing of ORFs and human-specific transcripts (ORF origination timings at Simiiformes and Hominoidea).
The findings indicated that de novo-emerged microproteins could function within short evolutionary periods. Of 44 de novo-origin functional microproteins, none were found to be coding, based on PhyloCSF and RNAcode analysis, and ribosome profiling scores predicted four of them as coding. Two ‘new’ ORFs of putative origination at Euteleostomi were determined as coding based on PhyloCSF and CPAT analysis.
Of seven ‘upstream’ ORFs, the young CATP0000 0415540.1, showed non-coding and de novo origin at the Simiiformes. Three SNPs were identified as pathogenic/likely pathogenic. Functional ORF CATP00000063293.1 (upstream, de novo origin, putative origin at Simiiformes) comprised a pathogenic SNP [SNP database (dbSNP): rs1555735545], related to limb-girdle muscular dystrophy. Another SNP was found on the ‘new’ coding ORF CATP00 000005301.1 (dbSNP: rs1238109100) and was likely pathogenic in association with retinitis pigmentosa. The third SNP overlapped ORF CATP00000363722.1 (dbSNP: rs1560929898), was non-coding, and related to Alazami syndrome.
CATP00001771233.1 ORF exemplified rapid gain of functionality among de novo-emerged ORFs, with origination timing at the human-chimpanzee ancestor. In chimpanzees, the locus was transcriptionally active in cardiac tissues only. In humans, the gene was strongly expressed during the induction of melanocytes. Identifying the orthologous genomic region lacking ORF in evolutionarily distant species such as armadillos, ASR findings, and lack of vertebrate proteomic and other matches in the NCBI (national center for biotechnology information) database indicated de novo origin.
Overall, the study findings highlighted functional microproteins originating de novo from noncoding sequences in the human lineage.