Genome Annotation

Prokaryotic

Lecture

After you have de novo assembled your genome sequencing reads into contigs, it is useful to know what genomic features are on those contigs. The process of identifying and labelling those features is called genome annotation.

Prokka is a "wrapper"; it collects together several pieces of software (from various authors), and so avoids "re-inventing the wheel".

Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence. Prokka uses a two-step process for the annotation of protein coding regions: first, protein coding regions on the genome are identified using Prodigal; second, the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases. Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats. More information about Prokka can be found here.

Generating a prokkaryote annotation

Prokka Tutorial Here

The GFF and GBK files contain all of the information about the features annotated (in different formats.)
The .txt file contains a summary of the number of features annotated.
The .faa file contains the protein sequences of the genes annotated.
The .ffn file contains the nucleotide sequences of the genes annotated.

Alternate ending

Artemis is a graphical Java program to browse annotated genomes. Download it here and install it on your local computer.

Copy the .gff file produced by prokka on your computer, and open it with artemis.

You will be overwhelmed and/or confused at first, and possibly permanently. Here are some tips:

There are 3 panels: feature map (top), sequence (middle), feature list (bottom)
Click right-mouse-button on bottom panel and select Show products
Zooming is done via the verrtical scroll bars in the two top panels

Refining the prokkaryote annotation

Apollo Tutorial Here Apollo practice

Bonus eggNOG annotation

Let's run the same genome information with eggNOG:

emapper.py -m diamond --itype genome -i m_genetalium_improved.fasta -o annotation_eggnog_genome

Let's run the protein information now:

emapper.py -m diamond --itype proteins -i uniprot_mycoplasma_reviewed.faa -o annotation_eggnog_proteins

What do you see in the results? What other parameter could we tweak?