Pankegg_make_db.py
Pankegg’s data parser that constructs the SQL database that is later used in the web application.
Input file format
To create the SQL database, you will need a CSV file listing all your samples and their corresponding input directories or files.
The expected CSV format uses the following header:
Sample name,Annotation_dir,classification_dir,Checkm2_dir
Each subsequent line represents a sample, and should look like one of the following examples:
- For Sourmash classification:
SAMPLE1,/Path/to/SAMPLE1/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE1/sourmash/*,/Path/to/SAMPLE1/checkm2_dir/quality_report.tsv
SAMPLE2,/Path/to/SAMPLE2/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE2/sourmash/*,/Path/to/SAMPLE2/checkm2_dir/quality_report.tsv- For GTDB-TK classification (when using the
--gtdbtkflag):
SAMPLE1,/Path/to/SAMPLE1/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE1/gtdb_results/*.summary.tsv,/Path/to/SAMPLE1/checkm2_dir/quality_report.tsv
SAMPLE2,/Path/to/SAMPLE2/bin_annotation/*.annotations.tsv,/Path/to/SAMPLE2/gtdb_results/*.summary.tsv,/Path/to/SAMPLE2/checkm2_dir/quality_report.tsvCreate the input file
You can create this input CSV file automatically with a bash loop, depending on how your data are organized:
If your results are grouped by sample:
Dir/
├── SAMPLEID1/
│ ├── bin_annotation/
│ ├── sourmash/
│ ├── gtdb_results/
│ └── checkm2_dir/
└── SAMPLEID2/
├── bin_annotation/
├── sourmash/
├── gtdb_results/
└── checkm2_dir/Run the following script to include Sourmash annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/*; do
name=$(basename "$sample");
ann_path="$sample/bin_annotation/*.annotations.tsv";
class_path="$sample/sourmash/*";
checkm2_path="$sample/checkm2_dir/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
doneRun the following script to include GTDBTK annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/*; do
name=$(basename "$sample");
ann_path="$sample/bin_annotation/*.annotations.tsv";
class_path="$sample/gtdb_results/*.summary.tsv";
checkm2_path="$sample/checkm2_dir/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
doneIf your results are grouped by tool, then sample:
Dir/
├── bin_annotation/
│ ├── SAMPLEID1/
│ ├── SAMPLEID2/
│ └── ...
├── sourmash/
│ ├── SAMPLEID1/
│ ├── SAMPLEID2/
│ └── ...
├── gtdb_results/
│ ├── SAMPLEID1/
│ ├── SAMPLEID2/
│ └── ...
└── checkm2_dir/
├── SAMPLEID1/
├── SAMPLEID2/
└── ...Run the following script to include Sourmash annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/bin_annotation/*; do
name=$(basename "$sample");
ann_path="Dir/bin_annotation/$name/*.annotations.tsv";
class_path="Dir/sourmash/$name/*";
checkm2_path="Dir/checkm2_dir/$name/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
doneRun the following script to include GTDBTK annotation:
echo "Sample name,Annotation_dir,classification_dir,Checkm2_dir" > samples.csv
for sample in Dir/bin_annotation/*; do
name=$(basename "$sample");
ann_path="Dir/bin_annotation/$name/*.annotations.tsv";
class_path="Dir/gtdb_results/$name/*.summary.tsv";
checkm2_path="Dir/checkm2_dir/$name/quality_report.tsv";
echo "$name,$ann_path,$class_path,$checkm2_path" >> samples.csv;
doneParameters Make DB
-i, --input
Path to the CSV file listing all samples and their input files/directories (required).-o, --output
Name of the output database file (without the extension). The default ispankegg. The.dbextension is automatically added to the output name.--output_dir
Directory the database will be written to. The default is./db_output.--gtdbtk
Use this flag if your classification files were generated with GTDB-TK instead of Sourmash.
Output Make DB
The output is an SQLite database (*.db) that can be opened with tools like sqlite3, but is best used with pankegg_app.py for interactive browsing.
The database contains the following tables:
taxonomybinmapkeggbin_map_keggbin_mapmap_keggbin_extrabin_extra_keggsample
Each table stores specific information related to bins, pathways, taxonomy, and annotation results for easy querying and visualization.
Run with Testdata
To verify your installation and familiarize yourself with Pankegg, you can run a test using provided data. Download the example archive, unzip it, and generate test databases using the included CSV files:
Download the test data archive from OSF:
wget https://osf.io/download/5v3zc/ -O pankegg_test_data.zipOR
curl -L -o pankegg_test_data.zip https://osf.io/download/5v3zc/Unzip the archive:
This will create a directory called pankegg_test_data.
unzip pankegg_test_data.zipCreate a test database for Sourmash classification:
python pankegg_make_db.py -i pankegg_test_data/sourmash_example.csv -o test_sourmash --output_dir pankegg_test_dataCreate a test database for GTDB-TK classification:
python pankegg_make_db.py -i pankegg_test_data/gtdbtk_example.csv -o test_gtdbtk --output_dir pankegg_test_data --gtdbtkAfter running these commands, you should find newly created test_sourmash.db and test_gtdbtk.db inside the pankegg_test_data directory. This is in addition to the already existing sourmash_example.db and gtdbtk_example.db files in the same directory.
The existing and newly generated respective databases should be identical, so the validity of pankegg_make_db.py can be tested by comparing the two files.
Troubleshooting
For more details or troubleshooting, please consult the Reporting Bugs & Contributing section.