FAIRy Notes #1: preflight before you submit
December 2025 (672 Words, 4 Minutes)
“My philosophy on technology is that people shouldn’t have to understand it in order to be able to use it.” — Radia Perlman
Most dataset submission pain isn’t hard, it’s just annoying. Its usually not that the science is wrong. It’s that the same dataset has to be shaped into a slightly different format for each receiver: a repository, a curator, a collaborator, a tool. And when you’re already deep in research mode, that “just put it in the right format” work can feel like needless friction.
I kept coming back to the same idea, so I decided to build a small metadata manipulation tool. The idea is this:
Run a tiny preflight locally before you submit.
Not to make your data perfect — just to catch the preventable problems early and make fixes less miserable.
What I mean by “preflight”
A preflight is a quick pass over your dataset (usually the metadata table) that produces:
- Pass / Warn / Fail results (so you know what’s blocking vs. what’s “fix when you can”)
- A human-readable report you can share (or keep as a record)
- Remediation hints (what broke + what to do next)
- Optionally: a small “bundle” of artifacts (reports, manifests, provenance notes) so you can reproduce what you checked later
Example preflight output: what’s blocking submission, why it matters, and how to fix it.
I’m building FAIRy around this model: validation is only useful if it helps you fix things.
The 5 boring issues that cause most chaos
These show up across domains — genomics, biodiversity, and every random CSV someone exported from a database.
1) IDs that aren’t stable
If an id column changes between exports, it breaks joins, annotations, deduping, and even your own “what did I fix last time?” notes.
Preflight check: ID is present, unique, and doesn’t contain obvious formatting weirdness (like spaces).
Fix: choose a stable key strategy, or generate one once and persist it.
2) Dates that are “valid” but wrong
A date can parse fine and still be suspicious (defaults like 2025-01-01, weird placeholders, or mixed formats).
Preflight check: parseable date and optional heuristics (e.g., warn on implausible ranges or on obvious defaults).
Fix: normalize to one format and document what the field means (collection date vs upload date vs observation date).
3) URLs that aren’t URLs
Classic: www.example.com without a scheme, or “URL-looking strings” that are actually notes.
Preflight check: if a field is supposed to be a URL, require a real URL shape (http:// or https://) and flag the rest.
Fix: add the scheme, or move the text into a notes field.
4) Numeric fields that drift out of reality
Latitude and longitude swapped, out-of-range coordinates, negative mass, impossible dimensions.
Preflight check: type and range checks (and warn if something looks swapped).
Fix: normalize units, confirm coordinate order, add constraints closer to the source system.
5) “Row number” confusion
One of the most annoying little bugs: reporting row numbers that don’t line up because of headers (or 0 vs 1 indexing).
Preflight check: not a data rule, just a UX rule: make error locations match what users see in Excel/Sheets.
Fix: always be explicit about whether row numbers include headers.
How someone else could use this idea (even without FAIRy)
If you’re a…
- Researcher: run preflight before you upload, so you don’t learn about basic issues from a rejection email.
- Curator/reviewer: share a preflight checklist (or rulepack) so feedback is consistent and less bespoke.
- Tool builder: treat rulepacks like “lint rules” for datasets — something the community can version and improve.
The vision
I want dataset submission to feel more like running tests on code: quick, local, and obvious about what to fix. The point is cutting down preventable back-and-forth so curators and researchers spend their time on the work they’re actually trained to do. FAIRy runs locally, applies a shared set of requirements (“rulepacks”), and produces something you can act on: a clear fix list. Over time, I want these rulepacks to be reusable and updated to community standards.
Where FAIRy fits
FAIRy is my attempt to make this workflow easy: run a command locally, get a report that tells you what’s blocking, what’s risky, and what to fix next — without uploading raw data anywhere by default.
This is still early and evolving, but the direction is pretty stable: turn validation into remediation.
Follow Along
I’m sharing these FAIRy Notes as I build a local-first “preflight” for research datasets — the boring checks that save time later.
If you want to check it out:
- FAIRy (code + issues): github.com/yuummmer/fairy-core
- More on the project (and the institutional side) lives at: datadabra.com
And if you’re working with messy metadata (genomics, biodiversity, anything tabular) and have a “this always breaks” example, I’d love to hear it. You can reach me on LinkedIn or GitHub.