Module: ncbi.datasets.package.dataset

Representations of a downloaded NCBI Datasets Package.

Module: ncbi.datasets.package.dataset

Representations of a downloaded NCBI Datasets Package.

NCBI Datasets provides data in ZipArchives for Genome, Gene, Pathogen and Virus resources. These classes each contain dataset catalogs that help programmatically determine the file contents.


A quickstart is to download a package, and then create a generic Dataset wrapper:

>>> from ncbi.datasets.package.dataset import get_dataset_from_file

package = get_dataset_from_file(path_to_file) for report in package.get_data_reports():

# do something with the protobuf report object

ncbi.datasets.package.dataset.get_dataset_from_file(zip_file_or_directory: str, dataset_type: str) ncbi.datasets.package.dataset.Dataset

Create a Dataset-derived object of type ‘dataset_type’ and return it.


A subclass of the class ‘Dataset’ as specified by the caller.

class ncbi.datasets.package.dataset.Dataset(zipfile_or_directory: str)

Bases: object

Base class to extract files from datasets package

Functions to extract files from a datasets package based on file names and types in the packages catalog file

is_zipped() bool

Return True if the dataset is stored in a zip file

get_file_root_dir() str

Return the data directory within the dataset (e.g. ncbi_dataset/data)

get_catalog() Dict[str, Any]

Return the datasets file catalog as a dictionary

get_file_names_by_type(file_type: str) List[str]

Return names of all files of type ‘file_type’, e.g. ‘PROTEIN_FASTA’

get_files_by_type(file_type: str) Iterator[Tuple[str, str]]

Return contents of all files of type ‘file_type’ along with their names

get_file_handles_by_type(file_type: str) Iterator[Tuple[TextIO, str]]

Return file handles for all files of type ‘file_type’ along with their names

get_file_types() List[str]

Return all file types available in the current dataset

get_file_content(file_name: str) str

Return full text of file ‘file_name’

get_file_handle(file_name: str) TextIO

Get handle of file using name within dataset directory


file_name – Name of file within the data directory, e.g. if the full datasets path is ncbi_dataset/data/GCF_000001405.39/chrX.fna, file_name should be GCF_000001405.39/chrX.fna


Handle to the specified file

stream_reports(file_type: str, protobuf_report_type: Any) Any

Retrieve report records defined via protobuf schema from jsonl files.

  • file_type – The type of file from the dataset catalog, e.g. ‘DATA_REPORT’ or ‘SEQUENCE_REPORT’.

  • protobuf_report_type – Schema, defined using GRPC protobuf, for the current dataset and file type.


Yields a set of protobuf objects for the dataset and file type.

class ncbi.datasets.package.dataset.AssemblyDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve Assembly reports

Methods to read Assembly and Assembly Sequence reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.assembly_pb2.AssemblyDataReport]

Retrieve assembly reports


Yields a set of AssemblyDataReport protobuf objects

get_sequence_reports() Iterator[ncbi.datasets.v1.reports.assembly_sequence_info_pb2.SequenceInfo]

Retrieve assembly sequence reports


Yields a set of Assembly SequenceInfo protobuf objects

class ncbi.datasets.package.dataset.GeneDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve Gene reports

Methods to read Gene reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.gene_pb2.GeneDescriptor]

Retrieve a gene report object


Yields a set of GeneDescriptor protobuf objects

class ncbi.datasets.package.dataset.VirusDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve Virus reports

Methods to read Virus reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.virus_pb2.VirusAssembly]

Retrieve virus assembly objects


Yields a set of virus assembly report protobuf objects

class ncbi.datasets.package.dataset.MicrobiggeDataset(zipfile_or_directory: str)

Bases: ncbi.datasets.package.dataset.Dataset

Retrieve MicroBiggee pathogen reports

Methods to read MicroBiggee reports

get_data_reports() Iterator[ncbi.datasets.v1.reports.microbigge_pb2.MicroBiggeReport]

Retrieve MicroBigge data report objects


Yields a set of MicroBigge report protobuf objects

Generated October 22, 2021