A new large language model (LLM) has apparently taken the performance crown from OpenAI’s GPT-4o about a month after its release: the new Claude 3.5 Sonnet chatbot and LLM from rival AI firm Anthropic, released today, bests all others in the world on key third-party benchmark tests, according to the company. And it does so while being faster and cheaper than prior Claude 3 models.
But it’s one thing to drop a new model and claim dominance, and yet another for users to truly experience and leverage the performance gains (Google Gemini family — I’m looking at you: supposedly better than OpenAI’s prior flagship GPT-4 on some metrics, but who is really using you?).
Anthropic’s latest release of Claude 3.5 Sonnet doesn’t seem to have this problem. Many AI influencers and power users have taken to the web in the few hours since its release to share their largely positive impressions about Anthropic’s new model, and show off what the new, “most intelligent” LLM in the world is able to accomplish.
Advancing coding skills and product creation
As enterprise AI influencer and expert Allie K. Miller wrote on X, Claude 3.5 Sonnet was able to create an entire playable game for her based on just a screenshot, in less than half a minute:
Countdown to VB Transform 2024
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
This is wild.
In just 25 seconds, Claude 3.5 Sonnet coded a fully functional Mancala web app for me ?️
I only provided ONE screenshot of the game’s instructions.
It did the rest:
– Coded the entire game
– Previewed it so I could test
– Provided rules of play pic.twitter.com/WLweZUGt5C— Allie K. Miller (@alliekmiller) June 20, 2024
Similarly, the informative and timely X account @TestingCatalog News showed how the newly launched “Artifacts” playground — which debuted alongside Claude 3.5 Sonnet, quite literally, showing a view of interactive outputs beside the chatbot interface — can execute code for real, working web form that Claude 3.5 Sonnet built.
Claude 3.5 just generated React jsx code with a simple contact form and managed to run it in the Artifacts playground ? pic.twitter.com/KREZaArObw— TestingCatalog News ? (@testingcatalog) June 20, 2024
It even was able to recreate imagery from the seminal 1995 movie Hackers:
Claude 3.5 Sonnet is the first model to recreate the 3D scene “Data flow” from the movie Hackers on the first try. Great job, Anthropic! https://t.co/zXJh6CNNxY pic.twitter.com/sXJDF9XLmS— Denis Shiryaev ?? (@literallydenis) June 20, 2024
Pietro Schirano, founder of enterprise AI image generation startup EverArt, wrote on X that combining Claude 3.5 Sonnet with another tool, Maestro, showed “sparks of AGI?”
Claude 3.5 Sonnet + Maestro = Sparks of AGI?
I asked to make a Mario clone using just geometric shapes, and the wildest part is that it gave the character animations as well, and the shapes seem like novel concepts.
It took 3 minutes. Look at the game! pic.twitter.com/YVQYp7m5Ed— Pietro Schirano (@skirano) June 20, 2024
Anthropic staffers go to bat for Claude 3.5 Sonnet
Though obviously biased, Anthropic developer relations team leader Alex Albert posted a thread on X highlighting how Claude 3.5 Sonnet is “starting to get really good at coding and autonomously fixing pull requests” and even went so far as to state: “It’s becoming clear that in a year’s time, a large percentage of code will be written by LLMs.”
Claude is starting to get really good at coding and autonomously fixing pull requests. It’s becoming clear that in a year’s time, a large percentage of code will be written by LLMs.
Let me show you what I mean:
Similarly, Anthropic technical staffer Maggie Vo posted on X that Claude 3.5 Sonnet can now do “half my job…and I couldn’t be happier.”
half my job is doable by 3.5 Sonnet now and I couldn’t be happier https://t.co/pqlN7P8qbC— Maggie Vo (@YoMaggieVo) June 20, 2024
Putting pressure on OpenAI
Others observed that now that Claude 3.5 Sonnet has eclipsed GPT-4o from OpenAI and is available at similar pricing, the latter company is under renewed pressure to continue making the case for its models as the right choice.
Pennsylvania University Wharton School of Business professor and AI booster Ethan Mollick compared the Artifacts feature to a “simpler version of Code Interpreter” from OpenAI’s GPT-4.
Been using the new Claude 3.5 model as a tester and now that it is out, I can say it is very very impressive, and the “artifacts” that it generates are like a simpler version of Code Interpreter
This is a real-time video of me creating a playable game and editing it with Claude pic.twitter.com/bWqw8F8CdH— Ethan Mollick (@emollick) June 20, 2024
X user @kimmonismus went even further, saying OpenAI will “sleep through AGI” or artificial general intelligence, the company’s stated goal of an AI model that outperforms humans in most economically valuable work. They blasted the company for announcing additional features with GPT-4o that have yet to ship, including new voice modalities.
Hey, @OpenAI. You sleep through AGI. While you make promises all the time (“Patience Jimmy, it will be worth the wait”) and announce without delivering (“GPT-4o-Voice within weeks”) the competition manages to deliver without making big announcements beforehand! Take a leaf out of… https://t.co/o6ROsZwDRG— Chubby♨️ (@kimmonismus) June 20, 2024
Still not human level
Despite the lofty praise around X, others noted that Claude 3.5 Sonnett still struggled with some of the seemingly basic cognitive tasks that humans can perform with relative ease, such as playing “tic tac toe.”
Frontier models like GPT-4o (and now Claude 3.5 Sonnet) may be at the level of a “Smart High Schooler” in some respects, but they still struggle on basic tasks like tic-tac-toe. There was hope that native multimodal training would help but that hasn’t been the case. pic.twitter.com/1iDq0DCL4Q— Noam Brown (@polynoamial) June 20, 2024
Similarly, tech journalist Timothy B. Lee, known from his handle @binarybits on X, noted that it “still makes goofy errors sometimes,” posting a screenshot asking it for the answer to a simple math word problem: which is worth more: 100 pennies or three quarters? to which it answered Three quarters, initially.
Continue to full article…
Source: venturebeat.com