TE-NGS

Transposable Element Next Generation Sequencing

View the Project on GitHub

Transposable Element Next Generation Sequencing

Transposable elements (TEs) are sequences that can transpose or “jump” from position to position within genomes. In humans, this jumping mechanism has resulted in sequence diversity among individuals, including phenotypes such as genetic disease. With Gerton Lunter, we have developed genomics tools for TE-enriched next generation sequencing and detection of polymorphic TEs in human genomes.


About

Why target TEs?

  1. TEs are ubiquitous: ~1/2 human genome is annotated as TE, and up to ~2/3 can be identified as ancient relics of TEs
  2. TEs are rare, but active: 3 subfamilies account for vast majority (>95%) of polymorphic TEs in humans, but combined account for <0.2% of the genome
  3. TEs are tricky to sequence: an estimated minimum of 140x coverage is needed to identify TEs reliably (>90% sensitivity) from whole genome sequencing

We developed TE-NGS to overcome these challenges. TE-NGS is

The method - described here consists of two parts

  1. a molecular genomics protocol for generating TE-enriched NGS libraries
  2. a computational pipeline for detecting TE insertions
    • get the source code

Implementation

TE-NGS consists of two principle steps to identify TE calls:

(i) clustering of reads based on genomic coordinates
(ii) annotation of clusters via comparison to public and local TE databases

TE-NGS is implemented in R and distributed using packrat for management of packages, dependencies, etc. The snapshot is built on R version 3.1.2.

The R tool employs the following packages:


Getting started

Navigate to /distrib and unbundle the packrat tarball:

> packrat::unbundle(bundle='scripts.tar.gz', where='/foo/TE-NGS/distrib')  

where /foo points to wherever the TE-NGS repo lives locally.

You should expect to see a progress message:

Untarring 'scripts.tar.gz' in directory '/foo/TE-NGS/distrib'...  

...  

Done! The project has been unbundled and restored at:  
- "/foo/TE-NGS/distrib/scripts"  

Next, navigate to /scripts where packrat snapshot is built:

$ cd scripts 

First time launching R, packrat will check the build of the local library of required packages and dependencies. Should see message, eg.

$ R
...
Packrat mode on. Using library in directory:  
- "/foo/TE-NGS/distrib/scripts/packrat/lib"  

Requirements

The R script requires the following arguments:


Examples

Running the script in test mode ensures that the output is as expected for a test sample provided in /distrib/test, eg.

$  R --no-restore --no-save --no-readline --quiet < R_get_TE_calls_v0.1.r --args ../test/ ../test/seq_opt_table.txt 201 test  

Run the script in sample mode to generate TE calls:

$ R --no-restore --no-save --no-readline --quiet < R_get_TE_calls_v0.1.r --args /bamdir/ /bamdir/seq_opt_table.txt 201 sample  

Troubleshooting

Ensure packrat is on

> packrat::on()

Check that packrat is up to date

> packrat::status()

CPU, memory, runtime

Runtime is fast! TE-NGS takes ~1 minute to process a typical TE-NGS bam on a single linux CPU (8G RAM).


Distribution

Along with the source code several additional files are provided:

The directory /annotations contains genomic interval flat files in build GRhs37 (hg19) for annotation of TE calls with known insertions.

The directory /examples contains


Citing

Kvikstad, E.M., Piazza, P., Taylor, J.C., Lunter, G. BMC Genomics (2018) 19: 115. https://doi.org/10.1186/s12864-018-4485-4


Read more

  1. Lander ES et al: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921.

  2. de Koning AP et al: Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011, 7(12):e1002384.

  3. Sudmant PH et al: An integrated map of structural variation in 2,504 human genomes. Nature 2015, 526(7571):75-81.