In 2004, I presented a paper at a local conference while working on a PhD at the University of Sydney on teaching computers to understand words. Not generate them. Not reason with them. Just figure out what a single word meant in a given sentence.
The word "bank," for instance, has ten different senses in WordNet. A financial institution. A sloping piece of land. A supply held in reserve. A container for coins. A flight manoeuvre. Our system's job was to look at the surrounding context and decide which meaning applied. That was the entire task. And it was considered a serious research problem.
We built a three-tiered feature extraction architecture. We used the Conexor parser to break text into parts of speech and syntactic links. We constructed domain vectors from the WordNet Domains database, a resource maintained by an institute in northern Italy that had painstakingly annotated tens of thousands of word senses with domain labels like ECONOMY, GEOGRAPHY, ARCHITECTURE. We fed all of this into Support Vector Machines through the Weka machine learning framework, tuning polynomial kernels with varying complexity and exponent values to find the optimal configuration.
The best F-score we achieved on the Senseval 2 benchmark was 0.56.
Fifty-six per cent accuracy. On a task that amounted to: given a word in a sentence, pick the right dictionary definition.
I am writing this article with the help of an AI that read that same paper in its entirety, understood the methodology, identified the key results, and is now helping me construct an argument about what those results mean in historical context. It did not need a domain database. It did not need a parser. It did not need kernel tuning. It just read.
Twenty-Two Years of Everything Else
To appreciate what happened in AI, you need to compare it with what happened everywhere else in the same period.
In 2004, the best-selling car in Australia was the Holden Commodore. It had a V6 engine, a CD player, and four-speed automatic transmission. By 2026, we have electric vehicles with over-the-air software updates and lane-keeping assist. Genuinely impressive. But the fundamental operation is recognisable. You get in, you drive, the car goes where you point it. The improvement is incremental, compounding over two decades of engineering refinement.
Computing made a bigger leap. In 2004, a decent laptop had 512 megabytes of RAM and a 30-gigabyte hard drive. The iPhone did not exist. Google had just gone public. Facebook was a month old. By 2026, the phone in your pocket has more processing power than anything that existed in a data centre in 2004. But even this followed a predicted trajectory. Moore's Law gave us a rough roadmap. We knew this was coming, even if the specifics surprised us.
Flight actually went backwards in some respects. The Concorde made its final flight in 2003. In 2026, there is still no commercial supersonic replacement. The planes we fly are quieter, more fuel-efficient, and have better entertainment systems. But a passenger experience comparison between 2004 and 2026 would struggle to identify a qualitative shift. You still sit in a metal tube for the same number of hours.
And then there is AI.
The Phase Change
In 2004, the state of the art in natural language processing was what I described above. Hand-built feature extraction pipelines. Curated domain databases. Statistical classifiers that needed extensive parameter tuning. Benchmarks where 56 per cent accuracy was a publishable result. The entire field was focused on narrow, well-defined tasks because general language understanding was not remotely on the horizon.
By 2025, GPT-4 achieves over 82 per cent accuracy on equivalent word sense disambiguation benchmarks. Without any task-specific training. Without domain databases. Without feature engineering. It simply understands language well enough that word sense disambiguation is a trivial subtask of its general capability. Leading models now perform on par with the best purpose-built WSD systems that took years of dedicated research to construct.
But framing this as "accuracy went from 56 to 82 per cent" misses the point entirely. That is like saying the Wright Brothers' achievement was a distance improvement over walking.
The 2004 system could disambiguate words. That was all it could do. It could not read. It could not write. It could not reason about what it read. It could not explain its reasoning. It could not take the output of its disambiguation and do anything useful with it. Every single capability had to be engineered separately, by separate teams, with separate architectures.
The 2026 system reads, writes, reasons, codes, translates, summarises, analyses data, generates hypotheses, and carries on extended conversations about all of the above. Word sense disambiguation is not a feature. It is not even a task. It is an emergent property of something far more general.
This is not an incremental improvement. Cars got faster. Planes got more efficient. Computers got smaller. AI underwent a phase change. It went from a collection of narrow tools to something that looks, increasingly, like general-purpose cognition.
What I Did Not See Coming
I want to be honest about this. In 2004, I did not predict any of it. Nobody in my research group did. The path from feature vectors and SVMs to transformer-based large language models was not visible, even to the people working in the field. We thought we were building towards better feature extractors, better domain databases, better classifiers. We thought the path to language understanding ran through more and more sophisticated versions of exactly what we were building.
We were wrong about the architecture. We were wrong about the data requirements. We were wrong about what "understanding" would look like when a machine did it. The deep learning revolution, the scaling laws, the emergence of capabilities at scale: none of this was anticipated by the mainstream NLP community of 2004.
And I think that is the more important point than any accuracy comparison. The technological leaps in cars, in computing, in aviation, they were extensions of known paradigms. Faster, smaller, lighter, more efficient. The AI leap was a paradigm replacement. The entire approach changed. The entire architecture changed. The results are qualitatively different from anything that existed before.
From Researcher to Practitioner
I left academia for medicine, and then left medicine for software engineering. The thread connecting all three is the same curiosity about systems and how they work. In 2004, I was trying to understand how machines could learn what words mean. In 2026, I am building fitness technology powered by AI that understands language better than any system I could have imagined.
My 2004 paper sits in the ACL Anthology, a digital library of computational linguistics research. It has been cited a few times. It made a modest contribution to a field that was about to be completely transformed. I am proud of the work, and I am simultaneously aware that every technique in that paper, every approach, every architectural decision, has been rendered obsolete not by better versions of those techniques, but by something entirely different.
Twenty-two years. From hand-crafted feature vectors and 56 per cent accuracy to fluent, multi-domain artificial intelligence. No other technology made that jump. Not even close.
Dr David Bell is a specialist anaesthetist (retired), software engineer, and founder of Align AI Fitness, based in NSW, Australia. His 2004 paper on word sense disambiguation is available at aclanthology.org/U04-1003.pdf.