home..

What Running GALES Taught Me About Genome Annotation (and Patience)

April 2025 (725 Words, 5 Minutes)

genome annotation CWL GALES microbial genomics FASTA

“Science has taught me that everything is more complicated than we first assume, and that being able to derive happiness from discovery is a recipe for a beautiful life.”
— Hope Jahren, Lab Girl

← Back to Home

From FASTA to Function

As part of my applied bioinformatics coursework, I used GALES, a prokaryotic genome annotation pipeline built on CWL (Common Workflow Language), to annotate a microbial genome starting from a raw FASTA file. This is my first run with genomic data, and it’s nice to finally produce results!

Where FASTA Files Come From

In a real lab setting, a FASTA file like the one I used might come from a DNA sequencing experiment. For example, a scientist might extract DNA from a bacterial sample, then sequence it using a platform like an Illumina MiSeq or NextSeq. The machine generates millions of short reads, which are then assembled into longer contiguous sequences (contigs) using an assembly tool.

The FASTA file I used was provided as part of the course and represents a prokaryotic genome — but you could easily try this out with something like E. coli by downloading it from the NCBI Genome database. Look for “Complete Genome” under assembly level and download the FASTA format to try it yourself.

What GALES Does

GALES takes a prokaryotic genome sequence and automates the steps of:

Predicting coding sequences (CDS) using Prodigal, a fast and widely used tool for identifying protein-coding genes in microbial genomes
Translating DNA to protein sequences
Annotating predicted proteins using RAPSearch2, a faster alternative to BLAST that aligns protein sequences against reference databases
Using a curated protein reference database like SwissProt, which is part of the UniProt collection and provides high-quality, manually reviewed protein annotations
Outputting a .gff3 file and supporting files like .faa, .fna, and .blast reports

All of this is combined using CWL (Common Workflow Language), a standardized format for describing data analysis pipelines that helps make them reproducible and shareable across systems.

Terminal Blur

The first time I ran the pipeline, a blizzard of information flashed across my screen — way too fast for my brain to keep up. I realized partway through that I had forgotten to redirect the output to a file, so instead of saving it neatly, I just watched valuable results scroll past at terminal-speed.

# Correct way to save output:
blastp -query input.faa -db dbs/swissprot -out results.blast -outfmt 6

# Mini test to validate the command structure
head -n 20 input.faa > test_input.faa
blastp -query test_input.faa -db dbs/swissprot -out test_output.blast -outfmt 6

This gives me a chance to spot mistakes early without wasting time or compute

Running the Pipeline on My Genome

To run GALES, I used a virtual environment in WSL with Docker installed (though GALES itself was run locally for flexibility).

GALES generated a bunch of the expected output files like:

prodigal.annotation.gff3
prodigal.annotation.faa
attributor.annotation.gff3
attributor.annotation.fna

The attributor.annotation.gff3 file was my main deliverable — a structured file listing gene positions and predicted functions.

📊 Quick Stats

*Genes predicted**: 4,338
*Output file size**: ~2.3 MB (GFF3)
*Pipeline runtime**: ~10–15 minutes on a local machine with 16 threads

What a GFF3 Annotation Looks Like

Here’s a small sample from the huge attributor.annotation.gff3 file I produced — this is where predicted genes, coding sequences, and annotations are recorded line by line. I like to open this kind of file in Sublime Text, which is my go-to text editor.

gff3 file sample

Each row represents a feature (like a predicted gene or protein), and the columns include:

The sequence ID (e.g. contig name)
The source of the prediction (e.g., prodigal)
The feature type (e.g., CDS = coding sequence)
Start and end positions on the genome
Strand (+ or -)
Annotation info, like a predicted product name

You’ll notice terms like “hypothetical protein” — that’s bioinformatics for: pretty sure this is a gene, but we don’t know what yet.

Cheetah Got out of Hand 🐆

There was supposed to be a nice little HTML file with a dashboard (generated by the cheetah visualizer) so I could review my results more easily. In the end, it gave me too much trouble — and I decided to save that part for another time.

Honestly, wrestling with CWL inputs taught me a lot more than I expected. And I still walked away with an annotated genome and a deeper understanding of how pipelines work, which was the whole point.

Overall Reflections + What’s Next

I’m really excited to have taken this first deep dive into understanding how a bioinformatics pipeline works. GALES helped me connect the dots between raw sequencing data and meaningful biological insight. Even though some parts (like the Cheetah visualization) didn’t go as planned, the process of troubleshooting and figuring things out gave me a much better understanding of what’s actually happening.

Lately, I’ve been thinking how cool it would be to collect my own environmental sample and send it to a lab for sequencing. I’m starting to brainstorm a little side project — we’ll see if I can pull it off soon!