home..

What Running GALES Taught Me About Genome Annotation (and Patience)

genome annotation CWL GALES microbial genomics FASTA

“Science has taught me that everything is more complicated than we first assume, and that being able to derive happiness from discovery is a recipe for a beautiful life.”
Hope Jahren, Lab Girl

← Back to Home

From FASTA to Function

As part of my applied bioinformatics coursework, I used GALES, a prokaryotic genome annotation pipeline built on CWL (Common Workflow Language), to annotate a microbial genome starting from a raw FASTA file. This is my first run with genomic data, and it’s nice to finally produce results!


Where FASTA Files Come From

In a real lab setting, a FASTA file like the one I used might come from a DNA sequencing experiment. For example, a scientist might extract DNA from a bacterial sample, then sequence it using a platform like an Illumina MiSeq or NextSeq. The machine generates millions of short reads, which are then assembled into longer contiguous sequences (contigs) using an assembly tool.

The FASTA file I used was provided as part of the course and represents a prokaryotic genome — but you could easily try this out with something like E. coli by downloading it from the NCBI Genome database. Look for “Complete Genome” under assembly level and download the FASTA format to try it yourself.


What GALES Does

GALES takes a prokaryotic genome sequence and automates the steps of:

All of this is combined using CWL (Common Workflow Language), a standardized format for describing data analysis pipelines that helps make them reproducible and shareable across systems.


Terminal Blur

The first time I ran the pipeline, a blizzard of information flashed across my screen — way too fast for my brain to keep up. I realized partway through that I had forgotten to redirect the output to a file, so instead of saving it neatly, I just watched valuable results scroll past at terminal-speed.

# Correct way to save output:
blastp -query input.faa -db dbs/swissprot -out results.blast -outfmt 6

# Mini test to validate the command structure
head -n 20 input.faa > test_input.faa
blastp -query test_input.faa -db dbs/swissprot -out test_output.blast -outfmt 6

This gives me a chance to spot mistakes early without wasting time or compute

Running the Pipeline on My Genome

To run GALES, I used a virtual environment in WSL with Docker installed (though GALES itself was run locally for flexibility).

GALES generated a bunch of the expected output files like:

The attributor.annotation.gff3 file was my main deliverable — a structured file listing gene positions and predicted functions.


📊 Quick Stats

What a GFF3 Annotation Looks Like

Here’s a small sample from the huge attributor.annotation.gff3 file I produced — this is where predicted genes, coding sequences, and annotations are recorded line by line. I like to open this kind of file in Sublime Text, which is my go-to text editor.

gff3 file sample

Each row represents a feature (like a predicted gene or protein), and the columns include:

You’ll notice terms like “hypothetical protein” — that’s bioinformatics for: pretty sure this is a gene, but we don’t know what yet.

Cheetah Got out of Hand 🐆

There was supposed to be a nice little HTML file with a dashboard (generated by the cheetah visualizer) so I could review my results more easily. In the end, it gave me too much trouble — and I decided to save that part for another time.

Honestly, wrestling with CWL inputs taught me a lot more than I expected. And I still walked away with an annotated genome and a deeper understanding of how pipelines work, which was the whole point.

Overall Reflections + What’s Next

I’m really excited to have taken this first deep dive into understanding how a bioinformatics pipeline works. GALES helped me connect the dots between raw sequencing data and meaningful biological insight. Even though some parts (like the Cheetah visualization) didn’t go as planned, the process of troubleshooting and figuring things out gave me a much better understanding of what’s actually happening.

Lately, I’ve been thinking how cool it would be to collect my own environmental sample and send it to a lab for sequencing. I’m starting to brainstorm a little side project — we’ll see if I can pull it off soon!

© 2025 Jennifer Slotnick   •  Theme  Moonwalk