Working with BAM Files

Step 1: Introduction

BAM files can be opened from remote locations (ftp, http) and from local computers. For viewing BAM files, an index file must be found in the same directory as the BAM file. The index should be named by appending “.bai” to the BAM file name. If there is no index file, you can use SAMTools to create one (please download SAMTools from http://samtools.sourceforge.net and install locally).

BAM data that is aligned to an assembly can be viewed as run accessions from the SRA database. To find aligned data, search the SRA database with a query that includes the parameter “aligned data"[Properties], for example:((("mus musculus"[Organism]) AND BALB/c) AND "lymph") AND "rna seq"[Strategy] AND “aligned data"[Properties].

This tutorial will take you through several scenarios to view BAM files in Genome Workbench:

  • A sorted BAM file with index file
  • A unsorted BAM file with no index file (requires SAMTools)
  • SRR13020989 – SRA run accession for data originated from GenBank

Since BAM files can be VERY large, they are not loaded entirely into the Genome Workbench project as other types of data and are accessed externally. Example files for this tutorial can be downloaded here (note the file is large ~356MB):

BAM Test Files

To see an example for a BAM file from a remote (ftp) location, please check BAM haplotype filtering tutorial. For more information on BAM files, see the BAM file FAQ section and Import BAMs video tutorial. For more information on SRR run accessions of GenBank data (and ERR/DRR run accessions for data originating from EMBL-EBI/DDBJ correspondingly) see SRA knowledge base page. See this video for information about viewing BAM files on the web GDV browser (note, this data will not be secure).

If you want to display de-novo BAM files that are aligned to novel (non-NCBI) genomes, you will need to first import a FASTA file with the sequences referred to in the BAM file and then import BAM file into the same project. For more information please refer to the Displaying de-novo BAM files tutorial.

Step 2: BAM file with index file

From the File menu choose Open and select BAM/CSRA files from the left side.

Open BAM dialog

Select button on the right that says Add BAM/CSRA file. Navigate to the BAM Test Files folder you downloaded, select scenario1_with_index, select file mapt.NA12156.altex.bam and click Open. Click Next three times (skip mapping dialog, since this data has mapped already) and then click Finish.

Now there is a 'New Project' in the Project Tree View. Double click on NT_ accession to open the Open View dialog. Select the Graphical Sequence View and see that the graphical view opens in a new tab (if the record has been updated you might see a warning message, click the OK button to close it). You can optionally open the NC_ accession to see the bam file mapped to the whole chromosome.

If you do not see the alignments in the graphical view tab, you will have to turn them on by clicking on the Context Menu (see figure below) and choosing Alignments. You might also select Graphs to see the coverage graph of this data listed among the graph tracks as well. Another way to find the track you just uploaded is to open the Configure Tracks dialog using the Gear icon and search by the track name/partial name (mapt.NA12156.altex) among All Tracks.

GSV with BAM alignment and Context menu

Configure tracks dialog

Step 3: Viewing BAM Data

In the graphical view, in the alignment tracks section, you should have a track titled "mapt.NA12156.altex, Coverage graph – log 2 scaled". All the standard Genome Workbench navigation tools are available for panning and zooming (see Basic Operation tutorial).

CSV with BAM coverage graph

Double clicking on the coverage graph track will open the Graph Rendering Options dialog where the rendering style, graph scale, color, etc. can be adjusted.

Graph rendering options dialog

If you zoom in far enough, you will see the coverage graph for the alignment track turn into a pileup graph and individual alignment features will become visible. (Note: coverage graph for Graph tracks always represents coverage graph and is not very informative at high zoom levels).

GSV with BAM pile-up graph

Mouse over the track name to make track settings located at the right of the track visible. You can adjust these settings if you wish. If you zoom in to the sequence level, you will see reads aligned to the anchor sequence with insertions and mismatches highlighted.

GSV with BAM zoomed to sequence

Pointing a mouse to an individual alignment feature will open a tooltip with a lot of useful information about the alignment, including the CIGAR string, percent identity, and coverage.

GSV tooltip for alignment

Step 4: BAM file with no index file

This exercise requires the use of SAMTools, a freely available package for working with BAM files. Download and expand the package and put it in a convenient folder/directory.

Then the steps are similar to scenario 1. From the File menu, choose Open and select BAM files from the left side of the dialog. Select button on the right that says Add a BAM file. Navigate to the BAM Test Files folder you have downloaded, select scenario2_no_index_unsorted_need_id_mapping and file GSM409307_UCSD.H3K4me1.bam, and click Next. You will see the dialog shown where Genome Workbench will ask where to find the SAMTools executable.

Open BAM samtools option

When you navigate to SAMTools on your computer click Open and then Next. New dialog appears asking about mapping the file to sequences.

In order to view the BAM file, the project must contain the sequences (e.g. accessions or chromosomes/scaffolds) that are referred to in the BAM file. Genome Workbench automatically finds sequences from NCBI that are referenced by GenBank or RefSeq accessions. If the BAM file uses a different style of sequence identifiers, the map assembly function allows Genome Workbench to convert them into NCBI assembly identifiers.

This example requires mapping, since reference accessions in the bam file are not typical CenBank/RefSeq accessions. (Note: in case you want to check what sequences are referenced in the BAM file, you can click on Next button in the mapping dialog and see it, and then use Back button to get back to the mapping dialog).

Open BAM accessions in bam file

In our test file, the chromosomes are named chr1, chr2, etc (the UCSC style of sequence identifiers), so we need to map them to the corresponding GenBank/RefSeq accessions for the particular assembly.

Add a checkmark to the Use Mapping check box, click on the Find Assembly button, and in the Select Assembly dialog type hg18. Then click on the Find Assembly button, select the RefSeq radio button, select NCBI36 (hg18) in the table, and click pn the OK button.

Open BAM select assembly dialog

See that mapping information was added to the Open BAM dialog.

Open BAM mapping

Click Next and see RefSeq accessions instead of chr1, chr2, etc:

Open BAM mapped accessions

All accessions are selected by default, but you can unselect/select any of them and only the selected accessions will be added to the new project. Click Next and Finish.

SAMTools can take couple of minutes to process this data. You can see your progress in the task view window.

Task view for BAM loading

Once it is finished, a new project with BAM data will be created in the Project Tree View.

Open any molecules that are in the project in the Graphical Sequence View and see the BAM alignment track among the Alignments tracks. All the standard Genome Workbench tracks settings and navigation tools are available (see Basic Operation tutorial and scenario 1/step 3 of this tutorial). If you zoom in far enough, you will see the pileup graph and individual reads aligned to the anchored reference sequence.

BAM loaded in project and GSV

Step 5: BAM data for SRA run accessions

Run accessions from the SRA database can be visualized in Genome Workbench if they are aligned to a GenBank or RefSeq sequence. To find such data, search for example query ("SARS-CoV-2"[Organism]) AND “aligned data"[Properties] on the NCBI SRA page.

You can view an SRA run accession (could be SRR, ERR or DRR) to Genome Workbench. As an example, we selected experiment SRX9471388, run SRR13020989.

Paste SRA run accession into the Open BAM dialog:

Open SRR accession

Click Next and see that Genome Workbench has detected that the BAM file references MN908947 (a GenBank accession).

Open SRR referenced sequence covid2

Click Next and Finish. See that a new Genome Workbench project has been created. Open the Graphical View and see the BAM alignment track. Zoom in/out, zoon to the sequence level, and adjust track settings if desired.

GSV with SRR alignment

Step 6: Export BAM file as a table

From the Graphical Sequence Viewer, zoom to the desired location and select a range of interest.

GSV with SRR region selected

Right-click the selected range and click the Export Data option in the context menu.

Context menu with Export Data option

An alignment export menu will be opened. Note that BAM files are stored as alignments, so you need to select “Alignment Table File” in the list on the left. Select the desired location in the main section. Name the target file. If you need to change the default export location use the folder button. Click the Next button.

Export Alignment table dialog

In the next screen, select fields from the alignment file to export.

Export Alignment table Select fields

Click the Finish button. Your file will be exported.

The exported file can be opened in a spreadsheet program like Excel for further use.

Exported Alignmet table view

Step 7: Finished!

Congratulations! You now know how open and manipulate several different flavors of BAM files in Genome Workbench.

Current Version is 3.7.1 (released October 13, 2021)

Release Notes

Downloads

General


Help


Tutorials


General use Manuals


NCBI GenBank Submissions Manuals


Other Resources


Support Center

Last updated: 2021-02-18T18:29:39Z