Annotated JSON file format overview

Annotated JSON file is result of annotation pipeline and is the principal source of data for a dataset. Usually it is archived by gzip or bzip2 because of its large size.

The file consists of text lines, each line is JSON record. The first record represents dataset metadata, next lines represent variants of dataset. Usually variant data is long: kilobytes and more. Variants are sorted by chromosome/position, and this order defines order of variants in dataset viewing regime.

Here is not complete description of annotated JSON format, we discuss here general structure of it and features that are essential for Anfisa functionality.

Metadata fields

Metadata is represented in JSON as dictionary, and the following fields are essential for proper setup of dataset in Anfisa:

  • "record_type", always "metadata"

  • "case", name of case

  • "data_schema", kind of data schema, determines variant of logic for preparation and configuration of dataset, in the current version only "CASE" is supported

  • "modes", list of modes of data, the the current version of the system only two modes are supported, and determine which base reference is used in dataset: either "hg38" or "hg19"

  • "samples", dictionary representing information about samples in family: identifier, name, sex, if affected, and relations between samples

  • "proband", identifier of proband in "samples" structure (proband must be always affected)

  • "versions", dictionary representing versions of sources used in annotation process

  • "cohorts", list, usually empty; in case if cohorts present, defines distribution of samples to cohorts

Variant data structure

Data for a variant is represented in JSON as dictionary with the following top blocks:

  • "record_type", always "metadata"

  • "_view", dictionary representing data used for "accurate presentation part" of viewing regime

    "_view" dictionary is container of dictionaries, one per "accurate" aspects

  • "_filtration", dictionary representing (most part of) data used for filtration

    "_filtration" dictionary is container of data items, each one of these items forms content of unit property fields

  • "__data", dictionary representing "technical presentation part" of viewing regime

    "__Data" dictionary is container of fields represented in "_main" technical aspect, but some fields in the dictionary can form separated technical aspects (usually multi-column ones and "VCF")

Thus, the blocks "_view" and "__data" are prepared for functionality of viewing regime, and the functionality of filtration is covered preferably by block "_filtration", but can use other blocks of data.

Location path notation in JSON variant annotation

To locate data inside JSON structure the following path notation is used:

  • any full path notation starts with /, it means that location points to root of JSON annotation

  • next blocks in notation are read from left to right and means move of location inside blocks of data

  • block <identifier>/ means move of location to identified field inside dictionary

  • block [] means that current point of location is a list, and moves location to elements of this list

  • relative path notation is equal to join of base path, symbol '/' and relative path, usually relative path is simple and contains only <identifier>

Examples of full paths:

"/_view/bioinformatics/called_by[]"

"/__data/transcript_consequences[]"

"/_filters/dist_from_exon_worst"

Fixed paths

The current version of the system assumes that information for zygosity and transcripts correspondently are located in annotated JSON record by the following paths:

/__data/zygosity

/_view/transcripts

All other paths in use are subject of configuration.

Modifications in annotated JSON record

On creation of dataset the JSON annotated variant record is being modifying by the following logic features of preparation process:

  • The following fields are created (processed from deeper information inside JSON record) and are put in top level dictionary and are used in the system for navigation purposes:

    • "_label" - visualization label of variant, visible in list of variants

    • "_color" - color indication of variant significance

    • "_key" - unique identifier of variant (used in tagging mechanism support for identification of records)

    • "_rand" - random integer, used for random variant sampling (auxiliary viewing regime)

  • Configuration of filtration schema API logic allows to configure:

See also

Code configuration layer overview

Administration aspects overview