Databases

sam2lca uses two different type of databases:

  • a taxonomy database to infer the Lowest Common Ancestor (LCA) and retrieve the names and lineage associated to a taxonomic identifier (TAXID)

  • an acc2tax or accession to TAXID database, to match sequence accession to a taxonomic identifier

For each of these databases, sam2lca offers different possibilities.

Taxonomy databases

ncbi

If you’re not sure what to use, stick with the default (ncbi)

gtdb

If you have bacteria and/or archea DNA sequencing data, you can altenatively choose to use the GTDB taxonomy, which is more phylogenetically consistent than the NCBI database. (see the GTDB article here: 10.1093/nar/gkab776).

To use the GTDB database with sam2lca, use:

--taxonomy gtdb --acc2tax gtdb_r207

This will work if you align your sequencing data against the gtdb_genomes_reps genomes.

As of 20/04/2022, only the latest GTDB release (r207) is available. For other (past or future) releases, please have a look at gtdb_to_taxdump and see custom section below, or open an issue on the sam2lca github repository.

custom

You can provide your own taxonomy database by providing the following files

  • names.dmp

  • nodes.dmp

  • merged.dmp

For example:

sam2lca update-db --taxonomy my_custom_db_name --taxo_names names.dmp --taxo_nodes node.dmp --taxo_merged merged.dmp 

Make sure than the taxonomic IDs are matching the accession2taxid that you’re using !

acc2tax - accession to TAXID databases

Nucleotide databases

  • nucl for nucleotide/DNA sequences, made of:

    • nucl_wgs : nucleotide sequence records of type WGS or TSA

    • nucl_gb : nucleotide sequence records that are not WGS or TSA

  • plant_markers for plant identication based on plant specific markers, made of:

    • angiosperms353 : Angiosperms353 marker data extracted from treeoflife.kew.org with sequence headers reformatted as following:

      Original fasta header

      >5821 Gene_Name:dph5 Species:Cyperus_laevigatus Repository:INSDC Sequence_ID:ERR3650073
      

      Reformatted fasta header

      >5821_Cyperus_laevigatus Gene_Name:REV7  Repository:INSDC Sequence_ID:ERR3650073
      

      This reformating is necessary to ensure the uniqueness of sequence identifiers. The fasta file with reformatted headers (dumped from treeoflife.kew.org on October 21st, 2021) is available for download here: angiosperms353_markers.fa.gz

    • ITS : ITS plant markers data extracted from the planITS project. The ITS database is available ITS.fa.gz

    • rbcL: rbcL plant marker extraced from 10.3732/apps.1600110, using the version updated on 09.07.2021, shared by the authors here. Fasta headers were rewritten to ensure the uniqueness of sequence identifiers and the dabase is available rbcl.fa.gz.

      Original fasta header

      >123456 Grabowskia glauca
      

      _Reformatted fasta header

      >rbcL_0_Grabowskia_glauca
      

Protein databases

  • prot for protein sequences, made of:

    • prot : protein sequence records which have GI identifiers

    • pdb : protein sequence records from the Protein Data Bank

Test database

  • test : local database to test sam2lca

Custom database

With sam2lca, you can provide a custom database to map accession numbers to TAXIDs.

To do so, sam2lca can accept a JSON file, with the --acc2tax_json flag in the sam2lca update-db subcommand in combination with --acc2tax custom.

For example:

sam2lca update-db --acc2tax_json acc2tax.json

This JSON file should be formatted as below:

{
    "mapfiles": {
        "[name_of_mapping]": [
            "path/url_to_compressed_accession2taxid.gz file"
        ]
    },
    "mapmd5": {
        "[name_of_mapping]": [
            "path/url_to_compressed_accession2taxid.gz md5sumfile"
        ]
    },
    "map_db": {
        "[name_of_mapping]": "Name of custom.db"
    }
}

An example json file can be found here: map_config.json