Chloroplast genome structures of Ligustrum species
The cp genomes of all four Ligustrum species were covalently closed double-stranded circular molecules, including a pair of sequences with the same coding but in the opposite orientation (IRa and IRb), one LSC region, and one SSC region. No deletions of large segments or regional bases were detected. The genome length ranged from 162,272 to 166,358 bp (Fig. 1). There were heteroplasmy. When each species is compared with L. sinense, different SNPs will be obtained. The cp genome length of L. obtusifolium and L. sinense was 815 bp different, and there were 291 SNPs in total. The cp genome length of L. vercaryi and L. sinense was 3996 bp different, and there were 274 SNPs in total. The cp genome length of L. ovalifolium ‘Aureum’ and L. sinense differed by 4086 bp, with a total of 284 SNPs (Supplemental file-SNP). Although heteroplasmy exists, but there is little difference in the type and number of cp genes (Table 1). The cp genomes of the four Ligustrum species are relatively conserved.
Next, the basic characteristics of the cp genomes of ten Ligustrum plants were evaluated. The total length of Ligustrum cp genomes ranged from 162,185 bp (L. vulgare) to 166,800 bp (L. ovalifolium). The length of the LSC region ranged from 86,885 bp (L. sinense) to 90,106 bp (L. ovalifolium); the SSC region length ranged from 11,446 bp (L. ovalifolium, L. ovalifolium ‘Aureum’) to 11,499 bp (L. gracile), the IR region length ranged from 31,608 bp (L. vulgare) to 32,624 bp (L. ovalifolium), the coding region length ranged from 84,903 bp (L. vicaryi) to 89,070 bp (L. ovalifolium), and the non-coding region length ranged from 75,662 bp (L. vulgare) to 81,365 bp (L. vicaryi) (Table 1). A total of 132–137 cp genes were detected, comprising 89–90 protein-coding genes, 35–39 tRNA genes, and 8 rRNA genes. GC content differed among positions within the cp genomes, and also different among genes coding different functions, with generally higher GC content in the gene-coding region (38.00–38.22%) than in the non-coding region (37.70–37.91%); GC content was highest in the IR region (41.16–41.40%), followed by the LSC region (36.17–36.33%) and SSC region (32.68–32.81%). The rRNA GC content of the entire coding region was 55.22–55.37%; the total GC content (37.93–38.06%) was lower than that in the IR region but higher than those in the SSC and LSC regions. Among protein-coding sequences, GC content was higher in the first than in the second and third (Fig. 2).
Duplicate genes were counted only once; thus, a total of 114 genes were annotated in the cp genomes of ten Ligustrum species, including 82 protein-coding genes, 4 rRNA genes, and 28 tRNA genes (Table 2). Introns play an important role in gene expression regulation. A total of 22 genes in the cp genomes of ten Ligustrum species contained introns, among which the genes ndhA, ndhB, petB, petD, atpF, rpl2, rpl16, rps12, rps16, rpoC1, accD, trnA-UGC, trnG-GCC, trnG-UCC, trnI-GAU, trnL-CAA, trnL-UAA, trnL-UAG, trnV-GAC, and trnV-UAC each contained one intron, and ycf3 and clpP each contained two introns. Only the accD gene of L. obtusifolium and L. vicaryi, contained one intron, whereas the accD gene of all other Ligustrum species had no introns; similarly, the trnV gene of L. sinense, L. obtusifolium, L. vicaryi, and L. ovalifolium ‘Aureum’ contained one intron, and the trnV gene of all other Ligustrum species had no introns. Gene intron loss occurs during the evolution of Ligustrum species (Supplementary Table 1).
Codon usage indices
Investigation of the codon usage preferences of Ligustrum species showed that codon adaptation index (CAI), codon bias index (CBI), frequency of optimal codons (FOP), and GC content at the third codon position (GC3) values were similar among the ten Ligustrum species, whereas the effective number of codons (ENc) was slightly higher in L. lucidum than in other species (Fig. 3). The ENc values of all Ligustrum cp protein-coding genes in this study were > 40; based on ENc values between 20 (complete preference) and 61 (no preference)16, the overall preference for codon use among Ligustrum cp protein-coding genes was weak.
IR contraction and expansion
The cp genome is a ring structure consisting of the LSC, SSC, IRa, and IRb regions, with four boundaries: LSC–IRb, IRb–SSC, SSC–IRa, and IRa–LSC. Expansion and contraction of the IR region of the cp genome is an important event in plant evolutionary history and causes changes in the size and gene content of the cp genome. In this study, we compared the LSC/IRb/SSC/IRa boundaries of cp genomes from ten Ligustrum species (Fig. 4). The genotypes of the IR–LSC and IR–SSC boundaries were essentially the same, with relatively conserved IR lengths among all ten species (31,608–32,624 bp) and no significant amplification or contraction events. The IR–SC boundary differed among the cp genomes of the ten Ligustrum species; seven protein-coding genes (rps19, rpl2, ndhH, ndhF, ndhA, rpl22, and trnH) were present at the LSC–IR and SSC–IR boundaries. The LSC–IRb boundary of L. lucidum was located between trnH and rpl2, with trnH located 14 bp to the left and rpl2 located 59 bp to the right. In all other species, the LSC–IRb boundary was located between rps19 and rpl2; in the other species, the LSC–IRb boundary extended into rps19 with a 1–2 bp length variation, except for that of L. vulgare, which was immediately adjacent to rps19. In L. obtusifolium, L. sinense, and L. vicaryi, ndhH was 1 bp to the left of the IRb–SSC boundary; in the other species, the IRb–SSC boundary extended into ndhH, with a length variation of 22–98 bp. The IRb–SSC boundary extended into ndhF by 26 bp in L. ovalifolium ‘Aureum’ and L. ovalifolium, was immediately adjacent to ndhF in L. obtusifolium and L. quihoui, and was located 4–10 bp to the right of ndhF in the other Ligustrum species. The SSC–IRa boundary of all Ligustrum species extended into ndhH, with a length variation of 74–83 bp; the ndhA gene was located 56–84 bp to the left of this boundary. The IRa–LSC boundary of L. lucidum was between rpl2 and trnH, with rpl2 located at a distance of 59 bp; rpl22 was located 500 bp to the right of the IRa–LSC boundary. In the other Ligustrum species, the IRa–LSC boundary was between rpl2 and trnH; rpl2 was located 58–63 bp to the left of the IRa–LSC boundary and trnH was located 13–15 bp to the right of the IRa–SSC.
Repeat sequence analysis and simple sequence repeats (SSRs)
Because SSRs have high polymorphism rates at the species level, they have become an important source of molecular markers, and have been extensively investigated in phylogenetic and population genetics studies. In this study, SSRs were mainly distributed in the LSC and SSC regions of the cp genome (Fig. 5A), which are also major cp distribution regions, with few SSRs in the two IR regions. According to SSR location analysis, most were distributed in the non-coding regions of the genome, i.e., the intergenic and intronic regions (Fig. 5B). A total of 164 (L. gracile, L. lucidum, L. japonicum, and L. vulgare) to 170 (L. obtusifolium) SSRs were detected in the cp genomes of Ligustrum species, which had the largest number of single nucleotides (140–155), dinucleotides (3–6), trinucleotides (5–13), tetranucleotides (2–4), pentanucleotides (1–3), and hexanucleotides (1–4) (Fig. 5C). Single nucleotide repeats may play a more important role in gene variation than other types of SSRs. These SSRs were dominated by single nucleotide (A/T)n (Fig. 5G), suggesting that the base composition of SSRs is biased toward A/T bases.
Long repetitive sequences (≥ 30 bp) may promote cp genome rearrangement and increase the function of species genetic diversity. A total of 223 (L. sinense) to 1,062 (L. ovalifolium) long repeat sequences were predicted in the Ligustrum cp genomes, including 142–862 forward repeats, 1–8 reverse repeats, 1–8 complementary repeats, and 40–194 palindromic repeats (Fig. 5D). The largest number of long repeats was found to have a length of 30–34 bp, and the smallest had a length of 65–69 bp (Fig. 5E). Among these, L. ovalifolium ‘Aureum’ had the highest number of long repeat sequences (Fig. 5F). We also detected 44 (L. vulgare) to 88 (L. ovalifolium) tandem repeats.
Comparative genomic divergence and hotspot regions
To determine the sequence differences among the ten Ligustrum cp genomes, we used L. sinense as a reference genome and compared them using the mVISTA software. Ligustrum cp whole-genome sequences encoded gene classes, numbers, and alignments that were highly consistent among species. Variation among sequences occurred mainly in non-coding intergenic regions, and coding regions were generally more conserved (Fig. 6).
Next, we calculated the nucleotide diversity (Pi) of the ten Ligustrum species. The high-variation regions of the Ligustrum cp genomes were mainly concentrated in the LSC and IR regions. Six regions, i.e., one intergenic region (rbcL_accD) and five genic regions (accD, clpP1-exon3, clpP1-exon2, ycf1, and ycf1), were considered as hotspot regions (Pi > 0.06), among which gene region accD had the highest nucleotide diversity (0.2552083), followed by the intergenic region rbcL_accD (0.172619) (Fig. 7, Table 3). Four of these hotspot regions were located in the LSC region and two in the IR region. Further analysis of the six hotspot regions showed that rbcL_accD intergene region included a large number of insertion and deletion events. There were large fragment deletion and intron loss in accD gene, resulting in large sequence difference and difficult sequence alignment. Therefore, it is not recommended as a candidate DNA barcode for the Ligustrum. However, the ycf1 gene region and the clpP1 exon region not only have high sequence variability, but also are coding region sequences, which can be accurately corrected by triplet codons. Therefore, the ycf1 gene region and the clpP1 exon region can be used as potential DNA barcodes for the identification and phylogeny of the Ligustrum.
Pairwise comparison of species Ka/Ks ratios and positive selection analyses
The Ka/Ks ratios of Ligustrum species were calculated to provide information on the selection pressure acting on individual sequences. Of the ten Ligustrum species, L. lucidum, L. gracile, and L. quihoui had higher Ka/Ks ratios (Fig. 8). Positive selection analyses of 78 single-copy protein-coding sequence genes from the ten Ligustrum species revealed four protein-coding genes (accD, clpP, ycf1, and ycf2) subject to significant positive selection (P < 0.05). Bayes empirical Bayes (BEB) analysis revealed significant posterior probabilities for the accD and rpl20 genes, with 49 positive selection sites for accD and four for rpl20 (Supplementary Table 2).
We applied a maximum likelihood (ML) model to construct a phylogenetic tree of 37 species belonging to 13 genera in Oleaceae. The relationships among the genera in this family were well handled, and the 13 genera clustered into one branch with high support for each node, which was consistent with the botanical classification (Fig. 9). Ligustrum species clustered into a single monophyletic clade, with high support. The European species L. vulgare was the first to differentiate. Ligustrum vicaryi, L. ovalifolium ‘Aureum’, and L. ovalifolium formed one branch, and L. obtusifolium formed another. Ligustrum sinense and L. quihoui clustered together, and Ligustrum and Syringa were more closely related than other genera in Oleaceae.