An overview of the sequential steps of the study is provided in Supplementary Fig. 1. We first evaluated genetic relatedness and population structure based on genome-wide SNP-array derived genotypes among the Racing breeds (Arabian, Mongolian Racing and Thoroughbred) in the context of non-Racing breeds using a principal component analysis (PCA) (Fig. 1) and an admixture structure plot (Fig. 2). In the PCA plot of PC1 and PC2, Thoroughbreds were tightly clustered with no overlap with any other breed, while Arabian and Mongolian Racing horses were more loosely distributed across PC1 (32.5% of the variance) and there was some overlap between the two populations although a shared recent ancestry was not expected.
Since Thoroughbred admixture has been observed in track racing Arabians7 and there has been an introduction of Thoroughbred stallions to Mongolia in recent years, we assessed the contribution of Thoroughbred ancestry in the Arabian and Mongolian Racing populations (Fig. 2). In the admixture structure analysis, the Belgian breed was used as a comparator population as it is distantly related to all three Racing breeds5. The lowest cross validation error for K = 2–6 modelled ancestral populations was observed for K = 4 (Fig. 2); therefore, this value was considered the most suitable for evaluating ancestry and quantifying admixture41,42. The Belgian and Thoroughbred displayed minimal evidence of admixture arising from the other breeds. Five Arabian samples exhibited >50% Thoroughbred genetic ancestry and eight had no Thoroughbred contribution. Except for five animals, the Mongolian Racing samples had some shared ancestry with the other breeds and Thoroughbred ancestry was >50% in one sample. There was minimal sharing of genetic background between the Mongolian Racing and Arabian populations. Based on the structure plot, it is clear that there is Thoroughbred ancestry in many of the Mongolian Racing and Arabian animals, and this is reflected in their position along PC1. Supplementary Data 1 details the individual ancestry contributions at K = 4 modelled ancestral populations for animals in the study.
Separate PCA plots were also generated for the Arabian and Mongolian Racing populations genotyped in this study compared to other Arabian horses7 (Supplementary Fig. 2) and Mongolian horses indigenous to Inner Mongolia, China4 (Supplementary Fig. 3). The Arabian horses were genetically diverse and were distributed predominantly across PC2 (16.2% of the variance), which encompassed most of the Arabian variation to the exclusion of the Straight Egyptian. The Mongolian Racing horses did not overlap with the Chinese Mongolian horses and were distributed mainly across PC2 (13.2% of the variance).
In summary, although there was some shared ancestry among the Racing populations, this was not widespread among individual animals suggesting that the observed Thoroughbred admixture most likely reflects recent breeding practices. Therefore, it is unlikely to influence detection of long-standing selection signals due to persistent selection over relatively extended time frames. Furthermore, the composite selection signals (CSS) approach used in this study combines component signals to detect only strongly selected regions that have a common signal across the constituent tests39. By contrast there is considerable Thoroughbred gene flow in other racehorse breeds such as Quarter Horse, which has a distinct subpopulation bred for racing5,16. Consequently, selection signals identified here among the Racing populations (Arabian, Mongolian Racing, and Thoroughbred) were hypothesised to reveal genomic regions contributing to similar genomic architectures that result from convergent evolution towards the racing phenotype and not gene flow.
Genomic signals of selection among Racing breeds
To identify genomic regions targeted by selection for the racing phenotype, we compared allele frequency distribution variation among two data sets comprising horses from Racing (n = 90) and non-Racing (n = 483) breeds (Supplementary Data 2) using a CSS test that combines the XP-EHH, FST and ∆SAF statistics39. Genome-wide distribution of the smoothed CSS score test statistics (-log10 P) for comparison of the Racing versus non-Racing populations identified 14 genomic regions with signals of selection, defined as clusters of SNPs (>5) among the top 1% SNPs, on ECA1, ECA2, ECA4, ECA5, ECA7, ECA9, ECA14, ECA17, ECA18, and ECA22 (Supplementary Data 3, Fig. 3). Signals on ECA1, ECA7, ECA17 and ECA18 were the highest ranking according to the CSS score. The top ranked region (ECA1, 45.34–46.54 Mb) contained the PCDH15 and ZWINT genes, which supports results obtained for these genes in two different Thoroughbred population samples15,16. A selection signature at this genomic region was also previously detected in the same Thoroughbred sample cohort used here with different statistical approaches (di, H, H12, and Tajima’s D)5,7.
There was considerable overlap with selection signals previously reported in a range of athletic horse breeds and the selection signals also overlapped with QTLs identified in GWAS for racing performance traits in Thoroughbreds43,44 (Table 1). Notably, the second ranked region (ECA7: 40.44–42.86 Mb) containing the NTM gene, coincided with the top GWAS peak identified from a comparison of Thoroughbreds that had raced and Thoroughbreds that had never had a racecourse start44, and a selection signal on ECA2 (ECA2: 100.3–101.78 Mb) overlapped with a GWAS peak for measured speed traits in juvenile Thoroughbreds43.
Functional enrichment among genes in selected regions
To assess enrichment of functional ontologies in selected regions for Racing, we assigned functional annotation to all the genes in the regions defined by the top 1% SNPs (including those with <5 SNPs) using the DAVID functional annotation tool45 (Table 2, Supplementary Data 4). A challenge to employing functional enrichment tools to such gene sets is the presence of gene family clusters in the same chromosomal region; for example, the gamma-aminobutyric acid signalling pathway (GO:0007214) and GABAergic synapse (GO:0098982) genes (GABRA1, GABRA6, GABRB2, GABRG2, SLC12A2) are, except for SLC12A2, located at a single locus on ECA14. Nonetheless, there were several exercise-relevant gene ontology terms enriched among the genes that were located on different chromosomes including heart looping (GO:0001947; BBS4, BBS5, SETDB2, KIF3A), cardiac muscle tissue morphogenesis (GO:0055008; BMP2, MYLK2, XIRP2), cellular respiration (GO:0045333; FASTKD1, COX4I2, TBRG4), skeletal muscle satellite cell differentiation (GO:0014816; MEGF10, MYLK2), and glycolysis/gluconeogenesis (GO:0006094; ADPGK, ALDH7A1, G6PC2, PKM) (Table 2).
Genomic signals of selection in Arabian, Thoroughbred, and Mongolian Racing breeds
We evaluated the overlap between the Racing selection signals and selection signals identified when each of the Racing breeds was analysed separately (Table 1, Supplementary Data 3). The overlap among the Racing selection signals with selection in the Thoroughbred (only) was clear, with nine of the 14 clusters also detected in the Thoroughbred versus other breeds analysis (Supplementary Fig. 4, Supplementary Data 3). There were six selected regions unique to Thoroughbred on ECA1, ECA21, ECA28 and ECA30.
There was also considerable overlap among the Racing selection signals with selection in the Arabian (only), with seven of the 14 clusters also detected in the Arabian versus other breeds analysis (Supplementary Fig. 5, Supplementary Data 3). There were 11 selected regions unique to Arabian on ECA2, ECA3, ECA8, ECA12, ECA19 and ECA23. The Arabian (only) selected regions contained some recognised equine exercise relevant genes including COX4I129,46,47, PPARGC1A47,48 and DMRT349; all three of these genes have been identified within runs of homozygosity in several horse breeds50.
Only two Racing selection signals overlapped with Mongolian Racing (only) selection signals, with 15 clusters unique to Mongolian Racing. Three regions unique to Mongolian Racing stood out as having very strong signals of selection (ECA5, 26, 28) (Fig. 4, Supplementary Data 3). The top ranked region spanned 5.6 Mb on ECA5 (43.32–48.93 Mb) and contained an eSNP (rs69550318) for the SELENBP1 gene that has been identified among the top 10 trans eQTL among genes expressed in Thoroughbred skeletal muscle51. In human endurance athletes SELENBP1 is differentially expressed in blood in response to administration of human recombinant erythropoietin suggesting a potential role in the regulation of haematopoiesis52,53. The top ranked SNP for the overall CSS score and XP-EHH statistic was located within the TBX15 gene that functions in skeletal development of the limb, vertebral column54, and shoulder and pelvic girdles55. Conformation is a key phenotype on which racehorses are selected and the axis of the pelvis has been shown to be associated with injury risk and performance in Thoroughbreds56. TBX15 also plays a major role in skeletal muscle fibre type differentiation and regulates the metabolism of glycolytic myofibres57 and white adipocytes58 especially in the browning of adipocytes and has been considered a target for the treatment of obesity59,60. In this regard, we previously proposed that adipocyte browning may be a key contributor to the equine athletic phenotype22. Furthermore, TBX15 is among the top ranked differentially expressed downregulated genes in Thoroughbred skeletal muscle in response to training24, implicating it as central to adaptation to the exercise stimulus.
The second ranked CSS region contained the top ranked SNP according to the ∆SAF statistic that was 16 kb from the closest gene, PPARA, which has a major role in exercise and training and is associated with elite human endurance athlete status61,62. The highest-ranking SNPs on ECA26 encompassed three genes (NRIP1, BTG3, and CHODL) all of which may be candidate genes for exercise adaptation63,64,65,66,67,68. The highest-ranking SNPs according to the FST statistic were on ECA7 within the NDUFB7 and CACNA1A genes. NDUFB7 encodes a structural subunit of complex I of the mitochondrial respiratory chain and mutations in the gene have been observed to cause hypertrophic cardiomyopathy and lactic acidosis69. In a cancer model, NDUFB7 expression is directly modulated by PPARA70 and mutations in CACNA1A cause congenital ataxia in humans71,72.
Challenges to identifying candidate genes
It is difficult to rely solely on selection signals arising from SNP genotyping array data to pinpoint the gene(s) that may have been subject to natural or human-mediated selection. Almost half of the selection signals identified in the four analyses spanned >1 Mb; the largest region was for Mongolian Racing (ECA5: 43.32–48.93 Mb, 5.6 Mb), followed by three regions on ECA17, ECA7, and ECA18 for Thoroughbred (3.5 Mb to 2.9 Mb in size); the largest selection signal for Arabian was 1.8 Mb on ECA3 (37.45–39.25 Mb); and the largest region for the Racing breeds was the second highest ranked selection signal on ECA7 (2.4 Mb). For all analyses, there was a positive relationship between CSS score and the size of the region identified (r2 = 0.6). Therefore, the most strongly selected regions (and regions of greatest interest) were the largest and in most cases these regions contained sizeable numbers of genes. Proximity of a top ranked SNP (using CSS or any of the individual test statistics) to a gene may be informative to some degree; however, linkage disequilibrium extends across large regions in horse populations17, the equine SNP genotyping arrays exhibit ascertainment bias73,74 being designed to assay genetic variation among the main European and North American breeds and, as illustrated above, many genes in each detected region exhibit biological functions easily interpretable as affecting exercise physiology. Therefore, to better characterise genes subject to selection for exercise traits, we used additional methods that leveraged transcriptomics data to prioritise SNPs proximal to DEGs in equine skeletal muscle, and WGS data from a cohort of (mostly) Asian horses.
Differential patterns of gene expression are the key determinants of phenotype, and integration of transcriptomics and genetic data has been successfully applied to understand the molecular basis of exercise adaptation75,76. Here, to refine the SNPs from the population genomics analyses, we integrated these data with DEG sets derived from Thoroughbred skeletal muscle RNA-seq data that distinguish exercise (untrained exercise, UE) and training (trained rest, TR) response transcriptomes24. For computational efficiency, the gene lists were refined to include DEGs with Padj. < 10−12 (UE) and Padj. < 10−4 (TR), which resulted in 407 (UE) (Supplementary Data 5) and 230 (TR) (Supplementary Data 6) DEGs .
The R software package gwinteR77 was used to determine whether genomic regions containing SNPs that are proximal to genes within the DEG sets were enriched for significance in the CSS analysis for Racing versus non-Racing breeds. The numbers of statistically significant SNPs pre- and post-data integration are summarised in Supplementary Data 7. In terms of SNP enrichment (Pperm. < 0.1), the integrative analysis was effective for the two input DEG sets. Using a search window that iteratively increased in size from 10–100 kb up and downstream of the genes of interest, a search space of ±100 kb produced the highest number of significantly enriched SNPs with the lowest probability of being significant by chance when compared to a null distribution of 1,000 sets of SNPs randomly sampled from the CSS dataset. SNPs within ±100 kb of a DEG were therefore selected as the target SNP sets to generate new q-values. Gene loci associated with enriched CSS SNPs are provided in Supplementary Data 8. Two genes (LPIN1, LRRC3B) were enriched for SNPs (q < 0.05) in the exercise response (UE) gene set and three genes (CBR4, SYNDIG1, MYOM2) were enriched for SNPs (q < 0.05) in the training response (TR) gene set. Five genes (HMOX1, KTN1, MYLK2, NEO1, TUBA4A) were common to both outputs (q < 0.1) and two of these (NEO1, MYLK2) were located within the selected regions defined by the top 1% CSS SNPs.
Since there was less overlap between the Mongolian Racing selection signals and the Racing selection signals than there was for the Thoroughbred and the Arabian, we separately integrated the Mongolian Racing CSS SNPs in the context of the skeletal muscle DEGs to refine the gene sets (Supplementary Data 9). Again, using 100 kb windows, three genes (PPP2R3A, PELO, GLB1) were enriched for SNPs (q < 0.05) in the exercise response (UE) gene set and three genes (TBX15, KHDRBS3, VEGFA) were enriched for SNPs (q < 0.05) in the training response (TR) gene set (Supplementary Data 10). In addition, three genes (MAP7D1, STAC3, VEGFA) were common among the two outputs (q < 0.1); however, none of these was located within the selected regions defined by the top 1% CSS SNPs. Four DEGs with localised SNP enrichment were located in the top-ranked CSS region for the UE and TR gene sets (APH1A, ATP1A1, UE; CA14, TBX15, TR) and four were among the other CSS regions (ANKRD23, IFI30, RAB30, UE; NXN, TR). The five most significant SNPs for the TR gene set were upstream and within the TBX15 gene. The most significant SNP for the UE gene set was in PPP2R3A.
Whole genome resequencing and variant calling
To identify gene variants with putative functional effects that may be the targets of selection we generated WGS data for 70 horse samples (Supplementary Fig. 6, Supplementary Data 11). A total of ~652 billion 150 bp paired-end reads were generated, with an average depth of 30.13× per individual animal and an average genome coverage of 99.58% (Supplementary Data 12). We obtained 3,846,455 and 3,511,329 polymorphic variants on average per sample after mapping with SAMtools and GATK, respectively, of which 3,177,005 were identified using both methods (Supplementary Data 13). After combining all SNPs from 70 animals, a final set of 24.41 million unique SNPs was retained (3.18 million/individual animal), along with 2.03 million insertion/deletion polymorphisms (indels). Among the ~2 million SNPs on the MNEc2M equine high-density SNP genotyping array74, on average 315,491 SNPs were identified in the sequenced samples with an average of 99.83% genotyping concordance, which demonstrates the reliability of our SNP calling (Supplementary Data 14).
Identification of sequence polymorphisms in exercise relevant genes
To generate a panel of sequence polymorphisms to test for alleles with significant deviations in frequency between different subgroups of horses, we focused on identifying protein-coding variants in candidate genes within the selection signal regions and genes identified from the integrative analysis. We focused on polymorphisms (SNPs and small indels) with moderate minor allele frequencies (MAF >0.1) as we did not expect this approach to identify rare small-effect variants. In addition, we did not expect to identify severely deleterious mutations, and therefore the search was not limited to variants with a predicted high effect on modifying gene function.
For the Racing breeds, the eight highest ranked selected regions and 11 significant regions from the integrative analysis (five common to UE and TR, including two that also overlapped with CSS; three unique to UE; three unique to TR) were used to search for putative functional variants with the WGS data (Table 3). Among the searched regions, for validation we chose high-effect variants in four candidate genes and moderate effect variants in 14 candidate genes (Supplementary Data 15). Three regions did not contain any variants that met the prioritisation criteria. Notably absent were variants in the top ranked CSS region on ECA1 that contained PCDH15 and ZWINT. PCDH15 has been associated with lipid phenotypes78, but is best known for association with deafness79 and is not a compelling candidate gene. On the other hand, the ZW10 interactor protein, encoded by ZWINT, functions in neurotransmitter release and in rodents mediates negative behaviour induced by neuropathic pain80, which may be relevant to exercise81. We previously reported a sequence tag <5 kb from ZWINT among the most differentially expressed downregulated transcripts in the training-response skeletal muscle transcriptome in the horse25 implicating the locus as functionally relevant to exercise. The absence of identified gene-specific variants in this region may be explained by the focus here on the identification of common protein-coding variants, which precludes the identification of sequence variants in genomic regulatory elements, copy number variants, and chromatin state modifications that also contribute to the gene regulatory networks underlying complex traits82,83.
A similar approach was taken to identify putative functional variants within regions identified for the Mongolian Racing analyses. For Mongolian Racing the seven highest ranked selected regions and seven significant regions from the integrative analysis (one common to UE and TR that also overlapped with CSS; four unique to UE including one that overlapped with CSS; two unique to TR) were prioritised (Table 3). For validation, we chose high effect variants in four candidate genes and moderate effect variants in eight candidate genes (Supplementary Data 15).
In total, 32 polymorphisms in 27 genes were selected for validation genotyping on the basis that the variants disrupt the sequence of proteins with central roles in exercise physiology—including key functions associated with muscle, heart, angiogenesis/blood, limb development, metabolism, and neurological tissues. The known biological functions of the genes are summarised in Supplementary Note 1. Of the 32 polymorphisms, 23 SNPs met the assay design criteria and passed post-genotyping quality control and were used in tests of genetic association. Genotypes were generated for independent validation sample sets that were not used for the selection signals analyses.
Genetic association with the racing phenotype
We hypothesised that genetic variants targeted by selection for the racing phenotype segregate among horse breeds to influence underlying endophenotypic variation. Genotypes for the panel of 23 SNPs were generated for n = 267 horses from six breeds (Arabian, French Trotter, Mongolian Racing, Quarter Horse, Standardbred, and Thoroughbred) chosen to represent racing breeds, and n = 249 horses from eleven breeds (putatively ancestral to Thoroughbred—Akhal Teke, Egyptian Arabian, Moroccan Barb; Chinese Mongolian landrace—Baerhu, Baicha Iron Hoof, Keerqin, Wushen, Wuzhumuqin; sport horses—Connemara, Irish Draught, Dutch Warmblood) representing non-racing breeds (Supplementary Data 16). Additional detail for the breeds is provided in Supplementary Note 2.
In tests of genetic association, SNPs in nine genes were significantly (Bonferroni-adjusted P < 3.57 × 10−3) associated with the racing phenotype (Table 4). Eight were missense variants predicted to have a moderate effect on the protein and one (SLC16A1) that introduces a stop codon was predicted to have a high effect on the protein. We did not expect to identify loss of function mutations, since we expected here to detect alleles that are advantageous for exercise. The introduction of a stop codon may not always disrupt the function of a protein if there is limited truncation or if there is stop codon read through84.
Biological functions relevant to exercise among genes significantly associated with racing
The functional relevance of this gene set is supported by the integrative analyses in which three of the genes (KTN1, MYLK2, and SYNDIG1) were enriched for SNPs among DEGs in the skeletal muscle exercise and training response. A literature search and review of associated gene ontology functions, indicated that this set of genes have roles in muscle (HDAC9, MYLK2), metabolism (FASKD1, G6PC2, GLB1, SLC16A1) and neurobiological (KTN1, NTM, SYNDIG1) functions that are linked to exercise-relevant phenotypes.
Skeletal muscle is a highly plastic tissue that responds to exercise and training stimuli by increasing muscle mass and changing fibre type composition with concomitant mitochondrial functional adaptations85. Here, we identified two genes relevant to muscle function that were significantly associated with the racing phenotype, and we consider to be core genes. The HDAC9 gene encodes a protein that inhibits skeletal myogenesis and is involved in heart development86,87,88,89. In Thoroughbred skeletal muscle HDAC9 is among the top five most significant DEGs downregulated in the exercise response (log2FC = −2.67, P = 1.21 × 10−20)24. In humans, HDAC9 gene variants are associated with the maximal oxygen uptake (VO2max) response to training90. Among the racing breeds, the Thoroughbred had the highest frequency (0.65) of the A-allele, which was more than twice the frequency of the allele among the other racing breeds (mean = 0.31) and 3.4× that among the sport horse breeds (0.19). Allele frequencies for all SNPs in each breed are shown in Supplementary Data 17.
MYLK2 encodes a myosin light chain kinase (MYL2) expressed in skeletal muscle. The enzyme has a critical role in muscle contraction, and functions in neuromuscular synaptic transmission, skeletal muscle satellite cell differentiation, regulation of muscle filament sliding and skeletal muscle cell differentiation91,92. MYLK2 was the most significantly downregulated gene (log2FC = −1.31, P = 1.37 × 10−22) among the 3,241 DEGs in Thoroughbred skeletal muscle following exercise and ranked 6th following a period of training (log2FC = −1.04, P = 1.13×10−6)24, higher than MSTN (14th, log2FC = −2.56, P = 1.43 × 10−6), a gene with a well-established functional role in exercise19,29,31,33. In humans, genetic variants in MYLK are associated with phenotypic responses to exercise-induced muscle damage93. Here, the A-allele that was significantly different between racing (0.38) and non-racing breeds (0.25), had, among the racing breeds, the highest frequency in Arabian (0.63) and French Trotter (0.45) and the lowest frequency in Quarter Horse (0.28) and Standardbred (0.28) (Supplementary Data 17). Based on the considerable functional evidence, we propose that genetic variation in HDAC9 and MYLK2 has a critical role in determining the muscle phenotype of racehorses.
The metabolic properties of skeletal muscle are largely influenced by the proportion of slow (type I) and fast (type II) muscle fibres, which are defined by the myosin heavy chain isoforms and characterised by the different densities and functional properties of mitochondria94. Within the different fibre types, the glycolytic and oxidative pathways are tightly regulated to ensure an adequate supply of ATP to meet energy demands. The G6PC2 gene encodes a major component of glycolysis95,96,97. Protein-coding variants in the gene are associated with fasting glucose levels in humans and there is strong evidence that G6PC2 is an effector gene for glucose regulation97. Among breeds, the G-allele occurred at the highest frequency in Thoroughbred (0.95) and was lowest in Connemara (0.50). The glycolytic requirements for high intensity exercise are likely responsible for the observed variation in this gene among the breeds. The FASTKD1 and PPIG genes are also located in the region exhibiting the selection signal for G6PC2. The FAST kinase domains 1 protein, encoded by FASTKD1, supports mitochondrial homeostasis, and has a critical protective role against oxidant-induced cell death98,99,100,101. However, the strongest association in racing breeds was with the G6PC2 SNP and its well-established biological function in the regulation of glucose suggests that it could underpin the selection signal at this locus.
One of the best characterised genes for racing performance in Arabian horses is the SLC16A1 gene encoding the solute carrier family 16 member 1 protein that catalyses the movement of lactate and pyruvate across the plasma membrane8,9,102. In humans, genetic variants in the gene are used to predict athletic performance, in particular high-intensity exercise, and power ability103,104. Here, we have identified a novel variant that is predicted to have a major effect on the resulting protein through the introduction of a stop codon. The value of this variant in prediction of racing performance among Arabian horses requires testing in horses phenotyped for economically relevant racing traits.
Neurobiological functions have regularly featured in equine exercise transcriptomics and genomics research24,43,105. Here we identified SNPs in three genes with functions in neurobiology, KTN1, NTM and SYNDIG1. Of particular note is NTM encoding neurotrimin, which functions in brain development, regulates neural growth and synapse formation, and influences learning and memory106,107,108,109,110,111. A GWAS in Thoroughbreds previously identified this locus as the most significantly associated with the number of racecourse starts44. NTM also ranks among the top 10 genes positively selected during horse domestication112 suggesting that equine neurological systems associated with domestication may overlap with adaptive traits that are required for racing. Here, the NTM SNP was the most significantly associated (P = 7.49 × 10−14) with the racing breeds and among all breeds the highest frequency of the racing allele was in the Thoroughbred (0.89).
In humans, KTN1 gene variants are strongly associated with KTN1 gene expression in the putamen and the volume of the putamen113, a region of the forebrain belonging to the basal ganglion that influences motor behaviours including motor planning and execution, motor preparation, amplitudes of movement and sequences of movement113,114,115,116,117,118,119,120. Here, the KTN1 G-allele was <1% in Thoroughbreds but had an average frequency of 0.22 in the other racing breeds, was 0.34 in the ancestral breeds and 0.21 in the sport horse breeds. Selection for the racing allele in breeds other than the Thoroughbred may be valuable in improving locomotor functions critical to racing. For SYNDIG1, the product of which regulates the development of excitatory synapses121,122,123, we observed that selection may already have fixed the exercise-favoured variant in racing breeds; the T-allele was absent in Thoroughbred, Arabian and Akhal Teke and was observed at a low frequency in the other breeds, with the highest occurrence in Connemara (0.29) and Irish Draught (0.25).
Genes associated with racing in Mongolian horses
Considering the difference in the selection signals profile of the Mongolian Racing horses (compared to the results from the Racing breeds), we also performed tests of genetic association for eight SNPs in a cohort of Mongolian horses selected by herdsmen for racing by comparing the genotypes to a set of Chinese Mongolian horses that are not used for racing (Supplementary Data 16). The GLB1 SNP was significantly (Bonferroni-adjusted P < 0.006) associated with the racing phenotype among Mongolian horses (Table 5). The protein encoded by GLB1, beta-galactosidase, has a role in several metabolic pathways and is the most widely used biomarker for senescent and aging cells124. There are a number of GLB1 related disorders125 including a disruption of normal skeletal morphologies126 and cardiomyopathies.
Genes associated with racing performance in Thoroughbred horses
To test for genetic association with racing traits among Thoroughbreds, we partitioned a large archive of samples (n = 1134) into three groups: horses classified as elite, horses that had raced but had never won a race, and horses that were unraced (Supplementary Data 18). Among a cohort of horses that had raced in North America, the MYLK2 SNP was significantly (P < 0.005) associated with elite racing performance, but it was not associated with the trait among Australian (P = 0.43) or European (P = 0.47) horses (Supplementary Data 19). We have previously observed regional-specific variation for racing performance among Thoroughbreds23, which may be due to different selection pressures for the various dynamics in each racing ecosystem. Among European Thoroughbreds, the NTM SNP was suggestive of association with the occurrence of a racecourse start (P = 0.01), and although it did not meet the threshold for significance following correction for multiple testing, the occurrence of this locus in a previous GWAS44, and the observation of the highest frequency of the SNP among Thoroughbreds, strongly implicates NTM as an economically important gene in the Thoroughbred.