LocusZoom.js is able to fetch data directly from tabix files. This is very helpful if you want to render a plot from your own data, without transforming the files into an intermediate format (like creating JSON files or loading the data into an API first). This is particularly important for very large datasets, because it allows interactive region-based queries, without having to transfer the entire dataset to the user's web browser first. Only the index file is loaded, plus the data needed for a particular region of interest.
LocusZoom provides an extension to help parse tabix data files for several common data formats. Additional parsers and index types may be supported in the future. Working directly with tabix files is a way to get started with LocusZoom fast as it does not require building or maintaining a server backend. It is helpful for sharing small numbers of datasets or in groups that don't want to maintain their own infrastructure. Typically, larger data sharing portals will not use this approach; as the number of datasets grow, they will want to run their own web server to support features like searching, complex queries, and harmonizing many datasets into a single standard format. Tools like PheWeb are a popular way to handle this sort of use case.
A key feature of LocusZoom is that each track is independent. This means it is very straightforward to define layouts in which some tracks come from a tabix file (like a BED track), while others are fetched from a remote web server that handles standard well-known datasets (like genes or 1000G LD).
In this demo, tabix loaders (and built-in file format parsers) are used for association, LD, and BED track data. Genes and recombination rate are fetched from the UM PortalDev API. View the source code of this page for details.
Below are some tips on formatting your data files. If you are using a static file storage provider like Amazon S3 or Google Cloud Storage, note that you may need to configure some additional request headers before Tabix will work properly.
There is no single standard for GWAS summary statistics. As such, the LocusZoom.js parser exposes many sets of options for how to read the data. The general instructions from my.locuszoom.org are a useful starting point, since the same basic parsing logic is shared across multiple tools.
BED files are a standard format with 3-12 columns of data. The default LocusZoom panel layout for a BED track will use chromosome, position, line name, and (optionally) color to draw the plot. Score will be shown as a tooltip field (if present); it may have a different meaning and scale depending on the contents of your BED file. As with any LocusZoom track, custom layouts can be created to render data in different ways, or to use more or fewer columns based on the data of interest.
The following command is helpful for preparing BED files for use in the browser:
$ sort -k1,1 -k2,2n input.bed | bgzip > input-sorted.bed.gz && tabix -p bed input-sorted.bed.gz
Some BED files will have one or more header rows; in this case, modify the tabix command with:
--skip-lines N (where N is the number of headers).
Linkage Disequilibrium is an important tool for interpreting LocusZoom.js plots. In order to support viewing any region, most LocusZoom.js usages take advantage of the Michigan LD server to calculate region-based LD relative to a particular reference variant, based on the well-known 1000G reference panel.
We recognize that the 1000G reference panel (and its sub-populations) is not suited to all cases, especially for studies with ancestry-specific results or large numbers of rare variants not represented in a public panel. For many groups, setting up a private LD Server instance is not an option. As a fallback, we support parsing a file format derived from PLINK 1.9 `--ld-snp` calculations. Instructions for preparing these files are provided below. Due to the potential for very large output files, we only support pre-calculated LD relative to one (or a few) LD reference variants; this means that this feature requires some advance knowledge of interesting regions in order to be useful. If the user views any region that is not near a pre-provided reference variant, they will see grey dots indicating the absence of LD information. We have intentionally restricted the demo so that this limitation is clear.
bcftools annotate -Oz --set-id '%CHROM\:%POS\_%REF\/%ALT' original_vcf.gz > vcfname_epacts_id_format.gz
zcat < vcfname_epacts_id_format.gz | cut -f3 | sort | uniq -d
bcftools norm -Oz --rm-dup all vcfname_epacts_id_format.gz > vcfname_epacts_id_format_rmdups.gz
plink-1.9 --r2 --vcf vcfname_epacts_id_format.gz --ld-window 499999 --ld-window-kb 500 --ld-window-r2 0.0 --ld-snp-list mysnplist.txt
16:53842908_G/A 16:53797908_C/G 16:53809247_G/A
cat plink.ld | tail -n+2 | sed 's/^[[:space:]]*//g' | sed 's/[[:space:]]*$//g' | tr -s ' ' '\t' | sort -k4,4 -k5,5n | bgzip > plink.ld.tab.gz && tabix -s4 -b5 -e5 plink.ld.tab.gz
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2 22 37470224 22:37470224_T/C 22 37370297 22:37370297_T/C 0.000178517Note: pre-calculated LD files can easily become very large. We recommend only outputting LD relative to a few target reference variants at a time.