
sam2lca uses two different type of databases:

  • a taxonomy database to infer the Lowest Common Ancestor (LCA) and retrieve the names and lineage associated to a taxonomic identifier (TAXID)

  • an acc2tax or accession to TAXID database, to match sequence accession to a taxonomic identifier

For each of these databases, sam2lca offers different possibilities.

Taxonomy databases


If you’re not sure what to use, stick with the default (ncbi)


If you have bacteria and/or archea DNA sequencing data, you can altenatively choose to use the GTDB taxonomy, which is more phylogenetically consistent than the NCBI database. (see the GTDB article here: 10.1093/nar/gkab776).

To use the GTDB database with sam2lca, use:

--taxonomy gtdb --acc2tax gtdb_r207

This will work if you align your sequencing data against the gtdb_genomes_reps genomes.

As of 20/04/2022, only the latest GTDB release (r207) is available. For other (past or future) releases, please have a look at gtdb_to_taxdump and see custom section below, or open an issue on the sam2lca github repository.


You can provide your own taxonomy database by providing the following files

  • names.dmp

  • nodes.dmp

  • merged.dmp

For example:

sam2lca update-db --taxonomy my_custom_db_name --taxo_names names.dmp --taxo_nodes node.dmp --taxo_merged merged.dmp 

Make sure than the taxonomic IDs are matching the accession2taxid that you’re using !

acc2tax - accession to TAXID databases

Nucleotide databases

Protein databases

  • prot for protein sequences, made of:

    • prot : protein sequence records which have GI identifiers

    • pdb : protein sequence records from the Protein Data Bank

Test database

  • test : local database to test sam2lca

Custom database

With sam2lca, you can provide a custom database to map accession numbers to TAXIDs.

To do so, sam2lca accepts a JSON file, with the --acc2tax_json flag in the sam2lca update-db subcommand.

For example:

sam2lca update-db --acc2tax plant_markers --acc2tax_json acc2tax.json

This JSON file should be formatted as below:

    "mapfiles": {
        "[name_of_mapping]": [
            "path/url_to_compressed_accession2taxid.gz file"
    "mapmd5": {
        "[name_of_mapping]": [
            "path/url_to_compressed_accession2taxid.gz md5sumfile"
    "map_db": {
        "[name_of_mapping]": "Name of custom.db"

An example json file (the default acc2tax.json) can be found here: acc2tax.json

Working offline

To setup sam2lca databases, you will need an internet connection. If you’re working on a system without a connection to internet, you can setup the database on another computer with access to internet, and then transfer over the sam2lca database directory.