protein

Download a SARS-CoV-2 protein dataset by protein name

protein

Download a SARS-CoV-2 protein dataset by protein name

Name

datasets download virus protein - Download a SARS-CoV-2 protein dataset by protein name

Synopsis

datasets download virus protein <protein_name ...> [flags]

Description

Download a SARS-CoV-2 protein data package by protein name. SARS-CoV-2 protein
data packages include CDS and protein sequence, annotation and a detailed data report.
Datasets are downloaded as a zip file.

The default SARS-CoV-2 protein data package includes the following files:

  • cds.fna (nucleotide coding sequences)

  • protein.faa (protein sequences)

  • data_report.jsonl (data report with viral metadata)

  • dataset_catalog.json (a list of files and file types included in the data package)

    Allowed protein names are: ORF1ab, ORF1a, nsp1, nsp2, nsp3, nsp4, nsp5, nsp6, nsp7, nsp8, nsp9, nsp10, rdrp, nsp11, nsp13, nsp14, nsp15, nsp16, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N, ORF10

Examples

  datasets download virus protein S --host dog --filename SARS2-spike-dog.zip
  datasets download virus protein rdrp --refseq --filename SARS2-rdrp-refseq.zip

Options

      --annotated                 Limit to annotated genomes
      --api-key string            Specify an NCBI API key
      --complete-only             Limit to complete genomes
      --debug                     Emit debugging info
      --filename string           Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --geo-location string       Limit to coronavirus genomes isolated from a specified geographic location (continent, country or U.S. state)
      --help                      Print detailed help about a datasets command
      --host string               Limit to virus genomes isolated from a specified host species
      --include string(,string)   Specify virus genome sequence types to download
                                    * cds:        nucleotide coding sequences
                                    * protein:    amino acid sequences
                                    * annotation: annotation report
                                    * biosample:  biosample report
                                    * none:       no sequence data, only primary data report
                                       (default [protein])
      --no-progressbar            Hide progress bar
      --refseq                    Limit to RefSeq coronavirus genomes
      --released-after string     Limit to coronavirus genomes released on or after a specified date (MM/DD/YYYY)
      --updated-after string      Limit to coronavirus genomes updated on or after a specified date (MM/DD/YYYY)
      --version                   Print version of datasets
Generated May 13, 2024