LLMs are improving — but vigilance isn't optional
A colleague caught what I missed. Here's what that revealed about trusting agentic workflows.
I received a message from two of my colleagues on a Slack group chat, where they quoted a very obvious inconsistency issue in a high level design (HLD) document which I had prepared a month ago. The whole high HLD doc was consistent but an obvious issue was in a flow which the LLM had mentioned. It wrote how a specific business logic is the responsibility of a service — and that was not the case. I saw the quoted paragraph and understood it in a fraction of a second that it was wrong.
When I did this HLD I was experimenting with a new HLD skill-based workflow to generate the HLD doc leveraging Copilot with the Claude Sonnet model.
I tested this agentic skill by generating HLDs before and found it to be satisfactory. I believe I trusted the outcome more than I should have and somehow it slipped out of the review cycle.
Well, I should have validated it line by line myself rather than completely trusting the outcome — or my skill evaluation should be much better to make sure issues like this get caught.
I believe model context size, quality of context and the way agents are programmed matters a lot and can bring a measurable difference in outcome quality. Sonnet on Copilot when I used it may not be the same today but at that time it didn’t do a good job even after instructing it to look into the low level design and source code to figure out the as-is design of the system.
But anyway, the issue has been flagged by another human and that’s a relief — but a pretty bizarre situation that makes you perceive these LLMs as unreliable.

