Whale ID
Cetacean re-identification from underwater photographs.
Origin
Dominica, Caribbean Sea, 2024. Sperm whale family units in the waters off the western coast have been studied longitudinally for decades; the same individuals have been photographed by the same researchers across multiple seasons. The bottleneck wasn't sighting effort — it was the manual matching that comes afterwards. A trained eye takes roughly twenty minutes per photograph to confirm whether the animal in the frame has been seen before. At population scale, with multi-decade datasets, that bottleneck made true longitudinal analysis uneconomic.
Problem
A sperm whale lives sixty to seventy years. Each photograph, in theory, is a longitudinal data point on an individual that will outlive most of its observers. The promise of population-scale, generational behavioral science depends on whether re-identification is a constant-time operation. Traditionally it is not.
Approach
Embedding model trained on the Dominica field corpus, plus similarity search over the population. Each new photograph is reduced to a vector and compared against everything previously catalogued; matches surface in milliseconds. The model isn't the contribution — the model is well-understood machinery. The contribution is conservation infrastructure: a re-identification step that fades into the background of a longitudinal study.
Methodology
Field-collected corpus only — no synthetic augmentation, no scraped imagery, no crowdsourcing. Photographs were taken from the boat over multiple seasons under variable light, sea state, and behavioral context. The model has to be robust to those conditions because they are the conditions of the field. A clean corpus would be a different problem.
Selected milestones
- First cross-population match at 91% confidence
- Pipeline handles partial-fluke and adverse-light photographs in production
- Co-authored paper accepted; expected publication 2026
Collaborators
Dominica Sperm Whale Project field researchers; co-author on the forthcoming paper.
Open questions
- How the embedding handles individuals first observed as juveniles vs. as adults
- Whether sequence-aware embedding (multi-frame) outperforms single-image embedding for tail-light cases
- What the right unit of comparison is when family units travel together
Ask me about
- How the embedding handles partial-fluke photographs and bad light
- Why field-collected — what crowd-sourcing would have lost
- What 91% confidence means in re-identification ground truth
- How the pipeline integrates with longitudinal behavioral analysis