Download large genome data packages

Use the command line to get large NCBI Datasets Genome Data Packages

Download large genome data packages

Use the command line to get large NCBI Datasets Genome Data Packages

If you want to download genome data for more than 1000 genomes or the genome data package exceeds 15 GB, you’ll need to use the datasets command-line tool .

Use the datasets command-line tool to download a large NCBI Datasets Genome Data Package as a dehydrated zip archive that contains only metadata and the location of the data on NCBI servers.
You can get the data in three steps:

  1. Download the dehydrated zip archive.
  2. Unzip the downloaded zip archive.
  3. Rehydrate the extracted zip archive to retrieve the data.

1. Download

Download a dehydrated data package (< 5 KB) for the human GRCh38 RefSeq genome using the datasets command-line tool .

datasets download genome accession GCF_000001405.39 --dehydrated --filename human_GRCh38_dataset.zip

2. Unzip

Unzip the dehydrated zip archive to a directory, for example my_human_dataset:

unzip human_GRCh38_dataset.zip -d my_human_dataset

The output will look like this:

Archive:  human_GRCh38_dataset.zip
  inflating: my_human_dataset/README.md
  inflating: my_human_dataset/ncbi_dataset/data/GCF_000001405.39/assembly_data_report.jsonl
  inflating: my_human_dataset/ncbi_dataset/data/dataset_catalog.json
  inflating: my_human_dataset/ncbi_dataset/fetch.txt

3. Rehydrate

Run the rehydrate command to get the full genome data package, including genome sequences and annotation:

datasets rehydrate --directory my_human_dataset/

A progress meter will indicate the number of files to be retrieved, showing the completion progress of a task. When complete, the output looks like this:

Found 43 files for rehydration
Completed 43 of 43 [================================================] 100%
Downloading: my_human_dataset/ncbi_dataset/data/GCF_000001405.39/chr6.fna    173MB done
Downloading: my_human_dataset/ncbi_dataset/data/GCF_000001405.39/chr5.fna    184MB done
Downloading: my_human_dataset/ncbi_dataset/data/GCF_000001405.39/chrX.fna    158MB done
Generated September 24, 2021