Transposable Element Next Generation Sequencing
Transposable elements (TEs) are sequences that can transpose or “jump” from position to position within genomes. In humans, this jumping mechanism has resulted in sequence diversity among individuals, including phenotypes such as genetic disease. With Gerton Lunter, we have developed genomics tools for TE-enriched next generation sequencing and detection of polymorphic TEs in human genomes.
Why target TEs?
We developed TE-NGS to overcome these challenges. TE-NGS is
The method - described here consists of two parts
TE-NGS consists of two principle steps to identify TE calls:
(i) clustering of reads based on genomic coordinates
(ii) annotation of clusters via comparison to public and local TE databases
TE-NGS is implemented in R and distributed using packrat for management of packages, dependencies, etc. The snapshot is built on R version 3.1.2.
The R tool employs the following packages:
Navigate to /distrib and unbundle the packrat tarball:
> packrat::unbundle(bundle='scripts.tar.gz', where='/foo/TE-NGS/distrib')
where /foo points to wherever the TE-NGS repo lives locally.
You should expect to see a progress message:
Untarring 'scripts.tar.gz' in directory '/foo/TE-NGS/distrib'...
...
Done! The project has been unbundled and restored at:
- "/foo/TE-NGS/distrib/scripts"
Next, navigate to /scripts where packrat snapshot is built:
$ cd scripts
First time launching R, packrat will check the build of the local library of required packages and dependencies. Should see message, eg.
$ R
...
Packrat mode on. Using library in directory:
- "/foo/TE-NGS/distrib/scripts/packrat/lib"
The R script requires the following arguments:
$ cat seq_opt_table.txt
experiment category TE_target_primers TE_nested index
NA12878_A element_A "Alu_target","L1HS_target" L1HS_nested 237
NA12878_C element_C "Alu_target","L1HS_target" AluYb89_nested 277
path to bam directory # a directory containing bam(s) generated by TE-NGS workflow
cluster distance (bp) # minimum distance required between neighboring genomic clusters [201 default]
Running the script in test mode ensures that the output is as expected for a test sample provided in /distrib/test, eg.
$ R --no-restore --no-save --no-readline --quiet < R_get_TE_calls_v0.1.r --args ../test/ ../test/seq_opt_table.txt 201 test
Run the script in sample mode to generate TE calls:
$ R --no-restore --no-save --no-readline --quiet < R_get_TE_calls_v0.1.r --args /bamdir/ /bamdir/seq_opt_table.txt 201 sample
Ensure packrat is on
> packrat::on()
Check that packrat is up to date
> packrat::status()
Runtime is fast! TE-NGS takes ~1 minute to process a typical TE-NGS bam on a single linux CPU (8G RAM).
Along with the source code several additional files are provided:
The directory /annotations contains genomic interval flat files in build GRhs37 (hg19) for annotation of TE calls with known insertions.
$ gunzip -c polyTEdb_window3flank600bp.interval.gz | more
chr1 645109 645710 1 ALU, AluYa4_5, phase3,
chr1 697981 698582 1 NA, NA, Witherspoon,
chr1 812282 812883 1 LINE1, LINE1, phase3,
The directory /examples contains
Kvikstad, E.M., Piazza, P., Taylor, J.C., Lunter, G. BMC Genomics (2018) 19: 115. https://doi.org/10.1186/s12864-018-4485-4
Lander ES et al: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921.
de Koning AP et al: Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 2011, 7(12):e1002384.
Sudmant PH et al: An integrated map of structural variation in 2,504 human genomes. Nature 2015, 526(7571):75-81.