home..

FAIRy Notes #1: preflight before you submit

“My philosophy on technology is that people shouldn’t have to understand it in order to be able to use it.”Radia Perlman

Most dataset submission pain isn’t hard, it’s just annoying. Its usually not that the science is wrong. It’s that the same dataset has to be shaped into a slightly different format for each receiver: a repository, a curator, a collaborator, a tool. And when you’re already deep in research mode, that “just put it in the right format” work can feel like needless friction.

I kept coming back to the same idea, so I decided to build a small metadata manipulation tool. The idea is this:

Run a tiny preflight locally before you submit.
Not to make your data perfect — just to catch the preventable problems early and make fixes less miserable.

What I mean by “preflight”

A preflight is a quick pass over your dataset (usually the metadata table) that produces:

FAIRy preflight report showing blocking FAILs and a WARN with fix guidance Example preflight output: what’s blocking submission, why it matters, and how to fix it.

I’m building FAIRy around this model: validation is only useful if it helps you fix things.

The 5 boring issues that cause most chaos

These show up across domains — genomics, biodiversity, and every random CSV someone exported from a database.

1) IDs that aren’t stable

If an id column changes between exports, it breaks joins, annotations, deduping, and even your own “what did I fix last time?” notes.

Preflight check: ID is present, unique, and doesn’t contain obvious formatting weirdness (like spaces).

Fix: choose a stable key strategy, or generate one once and persist it.

2) Dates that are “valid” but wrong

A date can parse fine and still be suspicious (defaults like 2025-01-01, weird placeholders, or mixed formats).

Preflight check: parseable date and optional heuristics (e.g., warn on implausible ranges or on obvious defaults).

Fix: normalize to one format and document what the field means (collection date vs upload date vs observation date).

3) URLs that aren’t URLs

Classic: www.example.com without a scheme, or “URL-looking strings” that are actually notes.

Preflight check: if a field is supposed to be a URL, require a real URL shape (http:// or https://) and flag the rest.

Fix: add the scheme, or move the text into a notes field.

4) Numeric fields that drift out of reality

Latitude and longitude swapped, out-of-range coordinates, negative mass, impossible dimensions.

Preflight check: type and range checks (and warn if something looks swapped).

Fix: normalize units, confirm coordinate order, add constraints closer to the source system.

5) “Row number” confusion

One of the most annoying little bugs: reporting row numbers that don’t line up because of headers (or 0 vs 1 indexing).

Preflight check: not a data rule, just a UX rule: make error locations match what users see in Excel/Sheets.

Fix: always be explicit about whether row numbers include headers.

How someone else could use this idea (even without FAIRy)

If you’re a…

The vision

I want dataset submission to feel more like running tests on code: quick, local, and obvious about what to fix. The point is cutting down preventable back-and-forth so curators and researchers spend their time on the work they’re actually trained to do. FAIRy runs locally, applies a shared set of requirements (“rulepacks”), and produces something you can act on: a clear fix list. Over time, I want these rulepacks to be reusable and updated to community standards.

Where FAIRy fits

FAIRy is my attempt to make this workflow easy: run a command locally, get a report that tells you what’s blocking, what’s risky, and what to fix next — without uploading raw data anywhere by default.

This is still early and evolving, but the direction is pretty stable: turn validation into remediation.

Follow Along

I’m sharing these FAIRy Notes as I build a local-first “preflight” for research datasets — the boring checks that save time later.

If you want to check it out:

And if you’re working with messy metadata (genomics, biodiversity, anything tabular) and have a “this always breaks” example, I’d love to hear it. You can reach me on LinkedIn or GitHub.

© 2025 Jennifer Slotnick   •  Theme  Moonwalk