AI Behaving Badly: Lies, Bad Coding, Hacking, and Missing Oversight

Li Nguyen

AI keeps asking for trust while supplying fresh reasons to keep one hand on the wallet and the other on the fire alarm.


Artificial intelligence had another rough week, and the problems did not come from a single corner. Google’s AI Overviews are still getting facts wrong at a rate that should make any search user twitch. Bluesky users are turning “vibe coding” into a public insult every time a site hiccup appears. Critics are taking fresh swings at the wider AI industry’s mix of hype, weak accountability, and messianic leadership. And Anthropic is holding back the Claude Mythos Preview from a public launch because the model appears strong enough to help find serious software vulnerabilities at scale.

Put those stories together, and a harsher pattern emerges. AI is not only making mistakes. AI companies are forcing the public to wrestle with a more irritating question: what happens when the tools get more capable faster than the institutions around them get more honest, careful, or competent? The answer, at least this week, is the wrong answers, sloppy blame games, bigger cyber risk, and an appetite for oversight.

What’s Happening & Why This Matters

Google’s AI Overviews Are Still Too Wrong

The most accessible example of AI behaving badly comes from the search box.

Recent analysis cited by Ars Technica says Google AI Overviews answered factual questions correctly about 90% of the time. That number sounds decent until the other side of the coin hits the table. A 10% miss rate at Google scale means millions of wrong answers can spill out daily, often as the first thing a user sees before clicking a source.

(CREDIT: ILLUSTRATION BY TF)

That is not a tiny flaw. That is industrialized overconfidence.

The test, conducted with help from startup Oumi using the SimpleQA benchmark, found that AI Overviews improved from about 85% accuracy under earlier Gemini-era conditions to roughly 91% after the Gemini 3 update. Fine. Improvement is real. Wrong answers still keep flowing. Even a model grounded in web results can still confidently choose the wrong answer, flatten contradictions, or fabricate certainty where no certainty exists.

Google disputed the benchmark’s value and said the test does not reflect how people really search. Fair defense. Users still see the core issue every day. AI Overviews keep presenting summaries with an authoritative tone that outstrips the reliability underneath.

That gap between confidence and truth is not merely annoying. Search is infrastructure. People use Google for health questions, finance questions, legal questions, travel decisions, schoolwork, and plain everyday fact-checking. A system that gets one answer in ten wrong at internet scale is not cute. A system like that is a liability dressed as convenience.

“Vibe Coding”: The Public’s New Favourite Slur

The second story sounds lighter. It is not actually light.

Bluesky had intermittent service problems, and users instantly began blaming “vibe coding” — a catch-all insult for sloppy AI-assisted software work. Bluesky itself blamed an upstream service provider, not some AI-generated coding disaster. The user reaction still told a story worth paying attention to.

People are starting to treat AI coding tools like a universal suspect. A site goes down — a feature breaks. A workflow is clumsy. The public reflex increasingly sounds like: some developer let the machine cook.

(CREDIT: ILLUSTRATION BY TF)

That reaction can be unfair. Plenty of outages have nothing to do with AI-generated code. Plenty of developers use coding assistants responsibly. Yet public suspicion did not fall from the sky. The suspicion stemmed from repeated stories of thin reviews, hallucinated functions, brittle AI-written code, and executives pitching speed while quietly downgrading craftsmanship.

The Bluesky pile-on indicates a cultural shift. AI-assisted coding is no longer receiving a novelty glow from many users. AI-assisted coding is starting to carry reputational debt. Once users begin to assume the machine wrote the bug, every failure takes on extra stigma.

That can create a weird side effect. Even disciplined teams may start hiding their AI use because the public has begun treating the phrase “vibe coding” as both a technical insult and a moral one.

The sharper truth is rougher. AI coding tools are not only for writing code. AI coding tools are writing a new trust problem for software teams.

An Industry-wide Leadership Problem

In another Ars Technica essay asking “What the heck is wrong with our AI overlords?”, Sam Altman served as a door into an industry critique. The critique suggests the behaviour is familiar. According to the essay, too many AI leaders keep talking like science-fiction evangelists while the rest of the world is being asked to absorb the risk, confusion, and social cost.

(CREDIT: ILLUSTRATION BY TF)

That argument is not only about one executive. The point is wider. Large AI companies keep cycling through the same rhetorical pattern: humanity is on the edge of dazzling abundance, the machines will raise productivity, work will improve, everything will get cheaper, and society only needs a little patience while the same firms keep collecting power.

That sales language is wearing thin.

Users are being asked to trust systems that hallucinate. Workers are being asked to trust companies that speak warmly about shared prosperity while automating jobs where possible. Regulators are being asked to trust executives who routinely present their own products as both urgently necessary and so powerful they deserve special treatment. That mixture has started to smell less like vision and more like managed self-interest.

A lot of the public frustration around AI is not really about math or model weights. A lot of the frustration is about posture. People can forgive a rough tool faster than they can forgive an industry that sounds smug while shipping rough tools.

That is why the leadership critique has its place. The trust problem is not only technical. The trust problem is cultural, managerial, and political.

Cybersecurity’s Future Is Getting Darker

If the Google and Bluesky stories show everyday AI friction, Anthropic’s Claude Mythos Preview shows the darker end of the capability curve.

Anthropic launched Project Glasswing, a cybersecurity initiative that gives select partners access to Mythos Preview for defensive work. Reuters reported that launch partners include major firms such as Amazon, Microsoft, Apple, Google, Nvidia, CrowdStrike, and Palo Alto Networks. Anthropic said the model has found thousands of major vulnerabilities in operating systems, browsers, and other software. The company is offering up to $100 million in usage credits and $4 million in donations to open-source security groups.

(CREDIT: ILLUSTRATION BY TF)

Anthropic’s own technical post goes further. The company says Mythos performs strongly across the board but is “strikingly capable” at computer security tasks, which is why the model is not being released. Anthropic wants industry and government to prepare before models in this capability class are easier to access.

That is a huge admission.

The company is effectively saying: we built something strong enough that a full public launch is reckless. That is not a normal product release problem. That is the sort of sentence that should make lawmakers, enterprise security teams, and rival labs are upright.

The tension is obvious. A model that can help defenders identify old, dangerous software flaws can also help attackers. Anthropic is trying to stay on the defensive side of that line. The line itself is getting thinner.

Oversight Is Not Optional

Put all four stories side by side, and a common theme emerges. Oversight is no longer some abstract policy wish from nervous academics. Oversight is starting to sound like basic maintenance for a messy industry.

Google’s search summaries show how easily generative AI can scale errors. The Bluesky reaction shows public trust eroding around AI-generated software. The leadership criticism shows that charisma and market power are not substitute ingredients for accountability. The Mythos story shows that frontier models are drifting into domains where misuse can produce real-world security consequences at frightening speed.

That combination should end the fantasy that the industry can self-soothe with blog posts and voluntary guardrails.

(CREDIT: ILLUSTRATION BY TF)

Different parts of AI need different kinds of supervision. Search products need stronger accuracy, discipline, and clearer user signalling. Coding tools need a stronger culture of review and less hype around raw speed. Frontier models with severe cyber implications need controlled access, partner review, government coordination, and tougher release logic.

None of that guarantees safety. All of that beats vibes.

The ugly lesson here is one the industry hates hearing. Capability growth is not a moral achievement by itself. Without discipline, capability growth is a scaling function for bad judgment.

AI Is Acting More Human in the Worst Ways

A final irony hangs over the whole week.

(CREDIT: ILLUSTRATION BY TF)

AI companies have spent years promising tools that reason better, write better, search better, code better, and protect systems better. Yet much of the most visible behaviour is still painfully human. The machines bluff. The companies overpromise. The public sometimes jumps to blame the wrong culprit. The leadership class spins a glossy story while trying to keep control of the stage.

In other words, AI is acting more human than advertised, but not in charming ways.

That does not mean the technology is useless. The Anthropic story alone shows the upside can be huge. A model that finds critical software flaws before criminals or hostile states do could save enormous amounts of pain. The problem is that upside and danger are arriving together.

That joint arrival changes the public bargain. People will tolerate impressive tools. People will not tolerate endless excuses forever. The companies selling AI as inevitable are going to discover that trust still needs to be earned the slow way.

TF Summary: What’s Next

This week’s AI behaving badly theme stretched across search, software, cybersecurity, and leadership. Google AI Overviews are still often wrong enough to make large-scale factual errors a structural problem. Vibe coding has reached a public-level insult because users increasingly suspect AI-generated software whenever products stumble. The wider AI industry is facing fresh criticism for grandiose rhetoric and weak accountability. At the sharpest edge, Anthropic is restricting access to the Claude Mythos Preview because the model appears powerful enough to uncover high-severity vulnerabilities at scale.

MY FORECAST: The public will grow less patient with AI theatre and more demanding about reliability, review, and release discipline. Search products will face stronger scrutiny. AI-assisted coding will carry more stigma unless teams prove quality. Frontier Labs will face mounting pressure to justify why some models deserve restricted access at all. The next chapter will not hinge on who ships the cleverest demo. The next chapter will hinge on who can convince the world that cleverness has not outrun control.

— Text-to-Speech (TTS) provided by gspeech | TechFyle


Share This Article
Avatar photo
By Li Nguyen “TF Emerging Tech”
Background:
Liam ‘Li’ Nguyen is a persona characterized by his deep involvement in the world of emerging technologies and entrepreneurship. With a Master's degree in Computer Science specializing in Artificial Intelligence, Li transitioned from academia to the entrepreneurial world. He co-founded a startup focused on IoT solutions, where he gained invaluable experience in navigating the tech startup ecosystem. His passion lies in exploring and demystifying the latest trends in AI, blockchain, and IoT
Leave a comment