Tabix tracks

< return home

LocusZoom.js is able to fetch data directly from tabix files. This is very helpful if you want to render a plot from your own data, without transforming the files into an intermediate format (like creating JSON files or loading the data into an API first). This is particularly important for very large datasets, because it allows interactive region-based queries, without having to transfer the entire dataset to the user's web browser first. Only the index file is loaded, plus the data needed for a particular region of interest.

LocusZoom provides an extension to help parse tabix data files for several common data formats. Additional parsers and index types may be supported in the future. Working directly with tabix files is a way to get started with LocusZoom fast as it does not require building or maintaining a server backend. It is helpful for sharing small numbers of datasets or in groups that don't want to maintain their own infrastructure. Typically, larger data sharing portals will not use this approach; as the number of datasets grow, they will want to run their own web server to support features like searching, complex queries, and harmonizing many datasets into a single standard format. Tools like PheWeb are a popular way to handle this sort of use case.

A key feature of LocusZoom is that each track is independent. This means it is very straightforward to define layouts in which some tracks come from a tabix file (like a BED track), while others are fetched from a remote web server that handles standard well-known datasets (like genes or 1000G LD).

In this demo, tabix loaders (and built-in file format parsers) are used for association, LD, and BED track data. Genes and recombination rate are fetched from the UM PortalDev API. View the source code of this page for details.

Data formatting guidelines

Below are some tips on formatting your data files. If you are using a static file storage provider like Amazon S3 or Google Cloud Storage, note that you may need to configure some additional request headers before Tabix will work properly.

GWAS Summary statistics

There is no single standard for GWAS summary statistics. As such, the LocusZoom.js parser exposes many sets of options for how to read the data. The general instructions from are a useful starting point, since the same basic parsing logic is shared across multiple tools.

BED files

BED files are a standard format with 3-12 columns of data. The default LocusZoom panel layout for a BED track will use chromosome, position, line name, and (optionally) color to draw the plot. Score will be shown as a tooltip field (if present); it may have a different meaning and scale depending on the contents of your BED file. As with any LocusZoom track, custom layouts can be created to render data in different ways, or to use more or fewer columns based on the data of interest.

The following command is helpful for preparing BED files for use in the browser:
$ sort -k1,1 -k2,2n input.bed | bgzip > input-sorted.bed.gz && tabix -p bed input-sorted.bed.gz
Some BED files will have one or more header rows; in this case, modify the tabix command with: --skip-lines N (where N is the number of headers).

Linkage Disequilibrium is an important tool for interpreting LocusZoom.js plots. In order to support viewing any region, most LocusZoom.js usages take advantage of the Michigan LD server to calculate region-based LD relative to a particular reference variant, based on the well-known 1000G reference panel.

We recognize that the 1000G reference panel (and its sub-populations) is not suited to all cases, especially for studies with ancestry-specific results or large numbers of rare variants not represented in a public panel. For many groups, setting up a private LD Server instance is not an option. As a fallback, we support parsing a file format derived from PLINK 1.9 `--ld-snp` calculations. Instructions for preparing these files are provided below. Due to the potential for very large output files, we only support pre-calculated LD relative to one (or a few) LD reference variants; this means that this feature requires some advance knowledge of interesting regions in order to be useful. If the user views any region that is not near a pre-provided reference variant, they will see grey dots indicating the absence of LD information. We have intentionally restricted the demo so that this limitation is clear.

Preparing genotype files: harmonizing ID formats
LocusZoom typically calculates LD relative to a variant by EPACTS-format specifier (chrom:pos_ref/alt). However, genotype VCF files have no single standard for how variants are identified, which can make it hard to match the requested variant to the actual data. Some files are not even internally consistent, which makes it hard to write easy copy-and-paste commands that would work widely across files. Your file can be transformed to match the tutorial assumptions via common tool and the command below:
bcftools annotate -Oz --set-id '%CHROM\:%POS\_%REF\/%ALT' original_vcf.gz > vcfname_epacts_id_format.gz
In some rare cases (such as 1000G phase 3), data preparation errors may result in duplicate entries for the same variant. This can break PLINK. A command such as the one below can be used to find these duplicates:
zcat < vcfname_epacts_id_format.gz | cut -f3 | sort | uniq -d
They can then be removed using the following command (check the output carefully before using, because reasons for duplicate lines vary widely):

bcftools norm -Oz --rm-dup all vcfname_epacts_id_format.gz > vcfname_epacts_id_format_rmdups.gz
Calculating LD relative to reference variants
The command below will calculate LD relative to (several) variants in a 500 kb region centered around each reference variant.
plink-1.9 --r2 --vcf vcfname_epacts_id_format.gz --ld-window 499999 --ld-window-kb 500 --ld-window-r2 0.0 --ld-snp-list mysnplist.txt
This command assumes the presence of a file named mysnplist.txt, which contains a series of rows like the example below:
Preparing the LD output file for use with LocusZoom.js
As of this writing, this tutorial assumes a "list of SNPs" feature that requires PLINK 1.9.x (and is not yet available in newer versions). Unfortunately, PLINK's default output format is not compatible with tabix, for historical reasons. Transform to a format readable by LocusZoom via the following sequence of commands.
cat plink.ld | tail -n+2 | sed 's/^[[:space:]]*//g' | sed 's/[[:space:]]*$//g' | tr -s ' ' '\t' | sort -k4,4 -k5,5n | bgzip > && tabix -s4 -b5 -e5
I don't use PLINK; how should my file be formatted?
A custom LD file should be tab-delimited. It should specify a reference variant ("SNP_A") and LD for all others relative to that variant ("SNP_B"). The first two rows will look like the following example, taken from actual PLINK output:
22	37470224	22:37470224_T/C	22	37370297	22:37370297_T/C	0.000178517
Note: pre-calculated LD files can easily become very large. We recommend only outputting LD relative to a few target reference variants at a time.