adVNTR - A tool for genotyping VNTRs

adVNTR is a tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data. It works with both NGS short reads (Illumina HiSeq) and SMRT reads (PacBio) and finds diploid repeating counts for VNTRs and identifies possible mutations in the VNTR sequences.

Manuals

adVNTR is a tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data. It works with both NGS short reads (Illumina HiSeq) and SMRT reads (PacBio) and finds diploid repeating counts for VNTRs and identifies possible mutations in the VNTR sequences.

FAQ

How do I cite adVNTR?

If you found adVNTR useful, we would appreciate it if you could cite our manuscript describing adVNTR:

Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. and Bafna, V., 2018. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Research, 28(11), pp.1709-1719.

Can adVNTR work with repeating units that are shorter than 6bp?

Tandem repeats with period below 6bp are classified as Short Tandem Repeats (STRs). Although adVNTR can detect STRs expansions, we do not recommend to use it on STRs.

Can I run adVNTR to study expansion in other organisms?

You can run adVNTR for other organisms if you add custom VNTR to its database. However, it always returns diploid RU counts for the number of repeats and it is expected to get homozygous RU counts on haploid organisms.

What sequencing platforms does adVNTR support?

adVNTR is designed to analyze Illumina or PacBio sequencing data. We generally do not recommend to use it on sequencing data from other technologies as their error model is different.

Installation

In order to use adVNTR, it is recommended to (1) install adVNTR using conda packaging manager and (2) download the predefined models for human genome from Data Requirements section. However, you can install it from the source and/or use custom models.

Install adVNTR with conda

If you are using the conda packaging manager (e.g. miniconda or anaconda), you can install adVNTR from the bioconda channel:

conda config --add channels bioconda
conda install advntr

adVNTR could be invoked from command line with advntr

Data Requirements

In order to genotype VNTRs, you need to either train models for loci of interest or use pre-trained models (recommended): * To run adVNTR on trained VNTR models:

  • Download vntr_data_recommended_loci.zip and extract it inside the project directory. This includes a set of pre-trained VNTR models for Illumina (6719 loci) and Pacbio (8960 loci) sequencing data.
  • You can also download and use vntr_data_genic_loci.zip for 158522 VNTRs that results in having much longer running time.

Alternatively, you can add model for custom VNTR. See Add Custom VNTR for more information about training models for custom VNTRs.

Execution:

Use following command to see the help for running the tool.

advntr --help

The program outputs the RU count genotypes of VNTRs. To specify a single VNTR by its ID use --vntr_id <id> option. The list of some known VNTRs and their ID is available at Disease-linked-VNTRs page in wiki.

See the demo execution here or Quick Start page to see an example data set with step-by-step genotyping commands.

Demo: input in BAM format

  • --alignment_file specifies the alignment file containing mapped and unmapped reads:
advntr genotype --alignment_file aligned_illumina_reads.bam --working_directory ./log_dir/
  • With --pacbio, adVNTR assumes the alignment file contains PacBio sequencing data:
advntr genotype --alignment_file aligned_pacbio_reads.bam --working_directory ./log_dir/ --pacbio
  • Use --frameshift to find the possible frameshifts in VNTR:
advntr genotype --alignment_file aligned_illumina_reads.bam --working_directory ./log_dir/ --frameshift

Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. and Bafna, V., 2018. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Research, 28(11), pp.1709-1719.

Quick Start

To help verify the installation and showing the workflow, we include a small data set and commands to genotype this simulated dataset. If you have already installed adVNTR, jump to Genotype Predefined VNTR in Simulated Data.

Install

The easiest way to get started is to Install adVNTR with conda. To install adVNTR, run these commands:

conda config --add channels bioconda
conda install advntr

Genotype Predefined VNTR in Simulated Data

To genotype a VNTR in the simulated dataset, one option is to use predefined models. Download vntr_data_recommended_loci.zip and extract it inside the project directory to use these models from human genome. Here, we genotype a VNTR with id 301645 that corresponds to a disease-linked VNTR. The list of some known VNTRs and their ID is available at Disease-linked-VNTRs page in wiki.

Then, download simulated sequencing data of a human sample. It only includes reads around a VNTR in CSTB gene which is known to be linked to progressive myoclonus epilepsies. Run this command to get 2/5 genotype for this VNTR.

advntr genotype --vntr_id 301645 --alignment_file CSTB_2_5_testdata.bam --working_directory working_dir

Genotype Custom VNTR

You can train a new model for a VNTR that doesn’t exist in predefined models. Instead of downloading vntr_data_recommended_loci.zip, you need the organism (here, human) reference genome to train a model for a specific VNTR. Download chromosome 21 of hg19 and extract it. It is recommended to have full reference genome of the organism to add the model, however, we use a single chromosome in quickstart since it is easier to download and runs faster. Run this command to add the VNTR in CSTB gene and train VNTR-specific scores:

advntr addmodel -r chr21.fa -p CGCGGGGCGGGG -s 45196324 -e 45196360 -c chr21

If you run the above command without using predefined models, this VNTR gets the first id. Run genotype command to get 2/5 genotype:

advntr genotype --vntr_id 1 --alignment_file CSTB_2_5_testdata.bam --working_directory working_dir

Tutorial

Inputs

  • NGS short reads (Illumina HiSeq)
  • SMRT reads (PacBio)

Outputs

Currently there are two possible formats to get the genotyping output:

  • Text

    Writes two lines in the output for each VNTR. The first contains the VNTR ID and the second line contains R1/R2 as the repeating unit counts. Below is an example output in text format for one VNTR:

    301645
    2/3

  • BED

    BED format contains one line per locus and it is a tab-delimited output comprised of 9 columns: 1. The name of the chromosome, 2. Start position of the VNTR, 3. End position of the VNTR, 4. VNTR ID, 5. Name of the gene that contains the VNTR, 6. Repeating motif, 7. Number of repeats in reference genome, 8 and 9. Number of repeats in the sample. Below is an example output in BED format for one VNTR:

#CHROM Start End VNTR_ID Gene Motif RefCopy R1 R2
chr21 45196324 45196360 301645 CSTB CGCGGGGCGGGG 3 2 3
  • VCF
    (Under construction)

Usage

adVNTR runs as follows:

usage: advntr <command> [options]

There are four commands:

genotype
Determine repeat unit count and sequence variation within VNTRs.
viewmodel
Show the structure and information about the VNTRs in the database.
addmodel
Add a custom VNTR to the database.
delmodel
Delete a VNTR from the database.

Each of these commands and their options is described below.

Genotype

Use advntr genotype [options] to genotype a VNTR using sequencing data. Alignment file and working directory are required.

Summary of options:

--frameshift: Use this option to identify frameshift instead of finding copy number of a VNTR.

--pacbio: Use this flag to genotype VNTRs using PacBio sequencing data.

--update: Use this option to iteratively update the model using real data before finding the genotype.

Input/output options:

-f, --alignment_file <file>
 alignment file in SAM/BAM/CRAM format
-r, --reference_filename <file>
 path to a FASTA-formatted reference file for CRAM files.
-f, --fasta <file>
 Fasta file containing raw reads
-p, --pacbio set this flag if input file contains PacBio reads instead of Illumina reads
-n, --nanopore set this flag if input file contains Nanopore MinION reads instead of Illumina
-o, --outfile <file>
 file to write results. adVNTR writes output to stdout if oufile is not specified
-of, --outfmt <format>
 output format. Allowed values are {text, bed} [text]

Algorithm options:

-fs, --frameshift
 set this flag to search for frameshifts in VNTR instead of copy number.
-e, --expansion
 set this flag to determine long expansion from PCR-free data
-c, --coverage <float>
 average sequencing coverage in PCR-free sequencing
--haploid set this flag if the organism is haploid
-naive, --naive
 use naive approach for PacBio reads

Other options:

-h, --help show this help message and exit
--working_directory <path>
 working directory for creating temporary files needed for computation
-m, --models <file>
 file containing VNTRs information [vntr_data/hg19_VNTRs.db]
-t, --threads <int>
 number of threads [4]
-u, --update set this flag to iteratively update the model
-vid, --vntr_id <text>
 comma-separated list of VNTR IDs
View VNTRs

Under construction …

Add Custom VNTR

Use advntr addmodel [options] to add a VNTR to the database. The structure of VNTR and its genomic coordinate are required.

Required arguments:

-r, --reference <text>
 Reference genome
-c, --chromosome <text>
 Chromosome (e.g. chr1)
-p, --pattern <text>
 First repeating pattern of VNTR in forward (5’ to 3’) direction
-s, --start <int>
 Start coordinate of VNTR in forward (5’ to 3’) direction
-e, --end <int>
 End coordinate of VNTR in forward (5’ to 3’) direction

Other options:

-g, --gene <text>
 Gene name
-a, --annotation <text>
 Annotation of VNTR region

-m/–models <file> VNTR models file [vntr_data/hg19_selected_VNTRs_Illumina.db] -h, –help show this help message and exit

You can use --update in genotyping step to iteratively update the model using real data.

Delete a VNTR

Use advntr delmodel --vntr_id <ID> to remove a VNTR from database.

Required arguments: -vid/–vntr_id <text> VNTR ID

Other options:

-m/–models <file> VNTR models file [vntr_data/hg19_selected_VNTRs_Illumina.db] -h, –help show this help message and exit

Publication

Bakhtiari, M., Shleizer-Burko, S., Gymrek, M., Bansal, V. and Bafna, V., 2018. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Research, 28(11), pp.1709-1719.