Small Indels
After point mutations, insertions and deletions (collectively “indels”) of multiple nucleotides are the second most common source of individual variation among and between populations. Until relatively recently, however, indels have been largely ignored as “nuisance events” in analysis of DNA and amino acid sequences.
Given their importance to genome evolution, population-scale variation, and function, reliable and efficient identification of indels from genomic sequence data is crucial. With Kateryna Makova at Penn State University, we developed software to identify and polarize indel mutations in multiple-species alignments.
Applying these tools to primate phylogeny, we now recognize that indels have shaped and been shaped by regional differences in genomic landscape features.
Indel mutation rates are also context dependent and highly variable, in non-coding and non-repetitive DNA.
Availability of population-scale data in 1000 Genomes Project helped us to demonstrate that indel population frequencies have been influenced by selective constraint, revealing the underlying genomic functions of regions such as genes, and potentially functional non-coding conserved regions that impact individual phenotypes.
Patterns of indel mutation and fixation dynamics provide clues to cryptic indel hotspots existing outside of known hotspot contexts like microsatellites and short tandem repeats. We showed that cryptic hotspots can even mimic signatures of natural selection, widely used to infer functional from non-coding DNA and disease-conferring from neutral loci.