Transposable Element Next Generation Sequencing

Transposable elements (TEs) are sequences that can transpose or “jump” from position to position within genomes. In humans, this jumping mechanism has resulted in sequence diversity among individuals, including phenotypes such as genetic disease. With Gerton Lunter, we have developed genomics tools for TE-enriched next generation sequencing and detection of polymorphic TEs in human genomes.

About

Why target TEs?

TEs are ubiquitous: ~1/2 human genome is annotated as TE, and up to ~2/3 can be identified as ancient relics of TEs
TEs are rare, but active: 3 subfamilies account for vast majority (>95%) of polymorphic TEs in humans, but combined account for <0.2% of the genome
TEs are tricky to sequence: an estimated minimum of 140x coverage is needed to identify TEs reliably (>90% sensitivity) from whole genome sequencing

We developed TE-NGS to overcome these challenges. TE-NGS is

High-throughput: NGS sequencing-based
Comprehensive: targets the 3 (L1HS, AluYa5/8, and AluYb8/9) most active TE subfamilies simultaneously
Practical: assembles routine molecular genomics techniques

The method - described here consists of two parts

a molecular genomics protocol for generating TE-enriched NGS libraries
- get the detailed procedures
a computational pipeline for detecting TE insertions
- get the source code

Implementation

TE-NGS consists of two principle steps to identify TE calls:

(i) clustering of reads based on genomic coordinates
(ii) annotation of clusters via comparison to public and local TE databases

TE-NGS is implemented in R and distributed using packrat for management of packages, dependencies, etc. The snapshot is built on R version 3.1.2.

The R tool employs the following packages:

ShortRead for efficient containers to read/process bam files
Rsamtools for bam manipulation
GenomicRanges for manipulation of genomic coordinates
Biostrings for fast implementation of regular expression pattern matching
data.table for binary search

Getting started

Navigate to /distrib and unbundle the packrat tarball:

> packrat::unbundle(bundle='scripts.tar.gz', where='/foo/TE-NGS/distrib')

where /foo points to wherever the TE-NGS repo lives locally.

You should expect to see a progress message:

Untarring 'scripts.tar.gz' in directory '/foo/TE-NGS/distrib'...  

...  

Done! The project has been unbundled and restored at:  
- "/foo/TE-NGS/distrib/scripts"  

Next, navigate to /scripts where packrat snapshot is built:

$ cd scripts

First time launching R, packrat will check the build of the local library of required packages and dependencies. Should see message, eg.

$ R
...
Packrat mode on. Using library in directory:  
- "/foo/TE-NGS/distrib/scripts/packrat/lib"  

Requirements

The R script requires the following arguments:

seq_opt_table.txt # a file describing the experimental conditions per sample, eg sample ID, TE, primers used, sequencing index ID

$ cat seq_opt_table.txt
experiment	category	TE_target_primers	TE_nested	index
NA12878_A	element_A	"Alu_target","L1HS_target"	L1HS_nested	237
NA12878_C	element_C	"Alu_target","L1HS_target"	AluYb89_nested	277

path to bam directory # a directory containing bam(s) generated by TE-NGS workflow
cluster distance (bp) # minimum distance required between neighboring genomic clusters [201 default]
mode # one of [test,sample] indicating the run mode
- test mode generates TE calls on NA12878 bams provided in /test
- sample mode generates TE calls on sample bams generated by TE-NGS workflow

Examples

Running the script in test mode ensures that the output is as expected for a test sample provided in /distrib/test, eg.

$  R --no-restore --no-save --no-readline --quiet < R_get_TE_calls_v0.1.r --args ../test/ ../test/seq_opt_table.txt 201 test  

Run the script in sample mode to generate TE calls:

$ R --no-restore --no-save --no-readline --quiet < R_get_TE_calls_v0.1.r --args /bamdir/ /bamdir/seq_opt_table.txt 201 sample  

Troubleshooting

Ensure packrat is on

> packrat::on()

Check that packrat is up to date

> packrat::status()

CPU, memory, runtime

Runtime is fast! TE-NGS takes ~1 minute to process a typical TE-NGS bam on a single linux CPU (8G RAM).

Distribution

Along with the source code several additional files are provided:

The directory /annotations contains genomic interval flat files in build GRhs37 (hg19) for annotation of TE calls with known insertions.

polyTEdb - we compiled an extensive local database of known polymorhic TE insertions obtained from public databases and published TE-targeting protocols. Each TE is cross-referenced giving the number of sources, subfamily annotation in each source (for disambiguation), and primary source reference.
```
  $ gunzip -c polyTEdb_window3flank600bp.interval.gz | more  
  chr1    645109  645710  1       ALU,    AluYa4_5,       phase3, 
  chr1    697981  698582  1       NA,     NA,     Witherspoon, 
  chr1    812282  812883  1       LINE1,  LINE1,  phase3, 
```

The directory /examples contains

NA12878_TE-NGS_calls_annotated_v0.1.bed - a flat file containing all TE-NGS calls made for NA12878 as described in manuscript .

Citing

Kvikstad, E.M., Piazza, P., Taylor, J.C., Lunter, G. BMC Genomics (2018) 19: 115. https://doi.org/10.1186/s12864-018-4485-4

Lander ES et al: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921.
de Koning AP et al: Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011, 7(12):e1002384.
Sudmant PH et al: An integrated map of structural variation in 2,504 human genomes. Nature 2015, 526(7571):75-81.

other projects