Spent a session building real tooling to read the AI's reasoning text — the previous version was eyeball-counting, fine for first pass, bad for any downstream claim. Before drawing conclusions about what the AI is doing, we need to know our measurements aren't artifacts of how we measure.
Two consecutive prototype runs gave different answers — AI flakiness or news arriving mid-day? Set up a back-to-back same-input experiment to disentangle the two. The first thing you have to know about a model that helps you decide is what kind of noise it produces.