Working with JSON Lines data reports

NCBI data packages contain a set of report files with metadata about the requested records, serialized as JSON Lines. Here are a few different ways to interact with those files.

Working with JSON Lines data reports

NCBI data packages contain a set of report files with metadata about the requested records, serialized as JSON Lines. Here are a few different ways to interact with those files.

Depending on the type of download, different report files are included. If you retrieve a gene download, only a gene data report is included, but a genome download includes both a genome assembly data report and a genome sequence data report . Report schemas document the available fields, their types, descriptions and provide examples.

In all the cases, the metadata is serialized as JSON Lines (pronounced “jason-lines”). This format enables stream processing, which is essential for potentially very large data reports that do not easily fit into memory.

Fortunately, a great tool set is available for working with JSON Lines.

Before we go further, we need to address the difficulties with converting any hierarchal object into rows and columns. Any nested, repeated object will have many possible ways to convert into a single row or set of rows. Imagine you have 3 gene records, each with 4 transcripts. Do you want to flatten this into 3 rows, with all transcript data delimited by a ;? Or do you want 12 rows, one for each transcript, and repeat the relevant gene information? Neither is right or wrong—it depends on what you will do next with the flattened representation of the data.

With that warning in mind, let’s look at a few common questions.

How do I convert a JSON Lines data report to a tabular report?

There are two recommended approaches:

First, retrieve a package of gene data for a set of NCBI Gene IDs.

Command

datasets download gene gene-id 1,2,3,9,10,11,12,13,14,15,16,17 --filename example_gene_data_package.zip
unzip -Z1 example_gene_data_package.zip

Output

Downloading: example_gene_data_package.zip    31.9kB done
README.md
ncbi_dataset/data/gene.fna
ncbi_dataset/data/rna.fna
ncbi_dataset/data/protein.faa
ncbi_dataset/data/data_report.jsonl
ncbi_dataset/data/data_table.tsv
ncbi_dataset/data/dataset_catalog.json

Now that you have a data package containing a gene data report in JSON Lines format, use the dataformat tool to transform it.

Command

dataformat tsv gene --fields gene-id,symbol,transcript-name --package example_gene_data_package.zip | head --lines=10

Output

NCBI GeneID	Symbol	Transcript Transcript Name
2	A2M	transcript variant 2
2	A2M	transcript variant X1
2	A2M	transcript variant 4
2	A2M	transcript variant 1
2	A2M	transcript variant 3
...
First, retrieve a package of gene data for a set of NCBI Gene IDs.

Command

datasets download gene gene-id 1,2,3,9,10,11,12,13,14,15,16,17 --filename example_gene_data_package.zip
unzip -Z1 example_gene_data_package.zip

Output

Downloading: example_gene_data_package.zip    31.9kB done
README.md
ncbi_dataset/data/gene.fna
ncbi_dataset/data/rna.fna
ncbi_dataset/data/protein.faa
ncbi_dataset/data/data_report.jsonl
ncbi_dataset/data/data_table.tsv
ncbi_dataset/data/dataset_catalog.json
Next, unzip just the data report file and process it with `jq`:

Command

unzip -qc example_gene_data_package.zip ncbi_dataset/data/data_report.jsonl | jq -r '"\(.geneId)\t\(.symbol)\t\(.transcripts[]?.name)"' | head --lines=10

Output

2	A2M	transcript variant 2
2	A2M	transcript variant X1
2	A2M	transcript variant 4
2	A2M	transcript variant 1
2	A2M	transcript variant 3
...

How do I filter the JSON Lines data report?

To filter for a record containing a specific value, you might be able to use grep , since each record occupies just a single line of the file.

How do I view a JSON Lines data report without using the command line?

If you prefer not to use the command line to view a JSON Lines data report, you may want to use the Dadroit JSON Viewer .

Open the JSON Lines data report to show the report as a collapsed tree. Click on the + plus symbol to expand any of the nodes and view the contents.

In this example, you can see the genomic range (location on the genome) for the human alpha-2-macroglobulin gene: Dadroit JSON Viewer showing gene data report

Generated November 19, 2021