Databases

sam2lca uses two different type of databases:

  • a taxonomy database to infer the Lowest Common Ancestor (LCA) and retrieve the names and lineage associated to a taxonomic identifier (TAXID)

  • an acc2tax or accession to TAXID database, to match sequence accession to a taxonomic identifier

For each of these databases, sam2lca offers different possibilities.

Taxonomy databases

ncbi

If you’re not sure what to use, stick with the default (ncbi)

gtdb

If you have bacteria and/or archea DNA sequencing data, you can altenatively choose to use the GTDB taxonomy, which is more phylogenetically consistent than the NCBI database. (see the GTDB article here: 10.1093/nar/gkab776).

To use the GTDB database with sam2lca, use:

--taxonomy gtdb --acc2tax gtdb_r207

This will work if you align your sequencing data against the gtdb_genomes_reps genomes.

As of 20/04/2022, only the latest GTDB release (r207) is available. For other (past or future) releases, please have a look at gtdb_to_taxdump and see custom section below, or open an issue on the sam2lca github repository.

custom

You can provide your own taxonomy database by providing the following files

  • names.dmp

  • nodes.dmp

  • merged.dmp

For example:

sam2lca update-db --taxonomy my_custom_db_name --taxo_names names.dmp --taxo_nodes node.dmp --taxo_merged merged.dmp 

Make sure than the taxonomic IDs are matching the accession2taxid that you’re using !

acc2tax - accession to TAXID databases

Nucleotide databases

Protein databases

  • prot for protein sequences, made of:

    • prot : protein sequence records which have GI identifiers

    • pdb : protein sequence records from the Protein Data Bank

Test database

  • test : local database to test sam2lca

Custom database

With sam2lca, you can provide a custom database to map accession numbers to TAXIDs.

To do so, sam2lca accepts a JSON file, with the --acc2tax_json flag in the sam2lca update-db subcommand.

For example:

sam2lca update-db --acc2tax plant_markers --acc2tax_json acc2tax.json

This JSON file should be formatted as below:

{
    "mapfiles": {
        "[name_of_mapping]": [
            "path/url_to_compressed_accession2taxid.gz file"
        ]
    },
    "mapmd5": {
        "[name_of_mapping]": [
            "path/url_to_compressed_accession2taxid.gz md5sumfile"
        ]
    },
    "map_db": {
        "[name_of_mapping]": "Name of custom.db"
    }
}

An example json file (the default acc2tax.json) can be found here: acc2tax.json

Working offline

To setup sam2lca databases, you will need an internet connection. If you’re working on a system without a connection to internet, you can setup the database on another computer with access to internet, and then transfer over the sam2lca database directory.