Common pitfalls when building generative AI applications
Friday January 17^th, 2025 at 2:58 AM

Chip Huyen

As we’re still in the early days of building applications with foundation models, it’s normal to make mistakes. This is a quick note with examples of some of the most common pitfalls that I’ve seen, both from public case studies and from my personal experience.

Because these pitfalls are common, if you’ve worked on any AI product, you’ve probably seen them before.

1. Use generative AI when you don't need generative AI

Every time there’s a new technology, I can hear the collective sigh of senior engineers everywhere: “Not everything is a nail.” Generative AI isn’t an exception — its seemingly limitless capabilities only exacerbate the tendency to use generative AI for everything.

A team pitched me the idea of using generative AI to optimize energy consumption. They fed a household’s list of energy-intensive activities and hourly electricity prices into an LLM, then asked it to create a schedule to minimize energy costs. Their experiments showed that this could help reduce a household’s electricity bill by 30%. Free money. Why wouldn’t anyone want to use their app?

I asked: “How does it compare to simply scheduling the most energy-intensive activities when electricity is cheapest? Say, doing your laundry and charging your car after 10pm?”

They said they would try it later and let me know. They never followed up, but they abandoned this app soon after. I suspect that this greedy scheduling can be quite effective. Even if it’s not, there are other much cheaper and more reliable optimization solutions than generative AI, like linear programming.

I’ve seen this scenario over and over again. A big company wants to use generative AI to detect anomalies in network traffic. Another wants to predict upcoming customer call volume. A hospital wants to detect whether a patient is malnourished (really not recommended).

It can often be beneficial to explore a new approach to get a sense of what’s possible, as long as you’re aware that your goal isn’t to solve a problem but to test a solution. “We solve the problem” and “We use generative AI” are two very different headlines, and unfortunately, so many people would rather have the latter.

2. Confuse 'bad product' with 'bad AI'

At the other end of the spectrum, many teams dismiss gen AI as a valid solution for their problems because they tried it out and their users hated it. However, other teams successfully used gen AI for similar use cases. I could only look into two of these teams. In both cases, the issue wasn’t with AI, but with product.

Many people have told me that the technical aspects of their AI applications are straightforward. The hard part is user experience (UX). What should the product interface look like? How to seamlessly integrate the product into the user workflow? How to incorporate human-in-the-loop?

UX has always been challenging, but it’s even more so with generative AI. While we know that generative AI is changing how we read, write, learn, teach, work, entertain, etc., we don’t quite know how yet. What will the future of reading/learning/working be like?

Here are some simple examples to show how what users want can be counter-intuitive and need rigorous user study.

My friend works on an application that summarizes meeting transcripts. Initially, her team focused on getting the right summary length. Would users prefer 3-sentence summaries or 5-sentence summaries?

However, it turned out that their users didn’t care about the actual summary. They only wanted action items specific to them from each meeting.
When LinkedIn developed a chatbot for skill fit assessment, they discovered that users didn’t want correct responses. Users wanted helpful responses.

For example, if a user asks a bot whether they’re a fit for a job and the bot responds with: “You’re a terrible fit,” this response might be correct but not very helpful to the user. Users want tips on what the gaps are and what they can do to close the gaps.
Intuit built a chatbot to help users answer tax questions. Initially, they got lukewarm feedback — users didn’t find the bot useful. After investigation, they found out that users actually hated typing. Facing a blank chatbot, users didn’t know what the bot could do and what to type.

So, for each interaction, Intuit added a few suggested questions for users to click on. This reduced the friction for users to use the bot and gradually built users’ trust. The feedback from users then became much more positive.
(Shared by Nhung Ho, VP of AI at Intuit, during our panel at Grace Hopper.)

Because everyone uses the same models nowadays, the AI components of AI products are similar, and the differentiation is product.

3. Start too complex

Examples of this pitfall:

Use an agentic framework when direct API calls work.
Agonize over what vector database to use when a simple term-based retrieval solution (that doesn’t require a vectordb) works.
Insist on finetuning when prompting works.
Use semantic caching.

Given so many shiny new technologies, it’s tempting to jump straight into using them. However, incorporating external tools too early can cause 2 problems:

Abstract away critical details, making it hard to understand and debug your system.
Introduce unnecessary bugs.

Tool developers can make mistakes. For example, I often find typos in default prompts when reviewing a framework’s codebase. If the framework you use updates its prompt without telling you, your application’s behaviors might change and you might not know why.

Eventually, abstractions are good. But abstractions need to incorporate best practices and be tested overtime. As we’re still in the early days of AI engineering, best practices are still evolving, we should be more vigilant when adopting any abstraction.

4. Over-index on early success

It took LinkedIn 1 month to achieve 80% of the experience they wanted, and an additional 4 months to surpass 95%. The initial success made them grossly underestimate how challenging it is to improve the product, especially around hallucinations. They found it discouraging how difficult it was to achieve each subsequent 1% gain.
A startup that develops AI sales assistants for ecommerce told me that getting from 0 to 80% took as long as from 80% to 90%. The challenges they faced:
- Accuracy/latency tradeoff: more planning/self-correction = more nodes = higher latency
- Tool calling: hard for agents to differentiate similar tools
- Hard for tonal requests (e.g. "talk like a luxury brand concierge") in the system prompt to be perfectly obeyed
- Hard for the agent to completely understand customers’ intent
- Hard to create a specific set of unit tests because the combination of queries is basically infinite
Thanks Jason Tjahjono for sharing this.
In the paper UltraChat, Ding et al. (2023) shared that “the journey from 0 to 60 is easy, whereas progressing from 60 to 100 becomes exceedingly challenging.”

This is perhaps one of the first painful lessons anyone who has built an AI product quickly learns. It’s easy to build a demo, but hard to build a product. Other than the issues of hallucinations, latency, latency/accuracy tradeoff, tool use, prompting, testing, … as mentioned, teams also run into issues, such as:

Reliability from the API providers. A team told me that 10% of their API calls timed out. Or product’s behaviors change because the underlying model changes.
Compliance, e.g. around AI output copyrights, data access/sharing, user privacy, security risks from retrieval/caching systems, and ambiguity around training data lineage.
Safety, e.g. bad actors abusing your product, your product generates insensitive or offensive responses.

When planning a product’s milestones and resources, make sure to take into account these potential roadblocks. A friend calls this “being cautiously optimistic”. However, remember that many cool demos don’t lead to wonderful products.

5. Forgo human evaluation

To automatically evaluate AI applications, many teams opt for the AI-as-a-judge (also called LLM-as-a-judge) approach — using AI models to evaluate AI outputs. A common pitfall is forgoing human evaluation to rely entirely on AI judges.

While AI judges can be very useful, they aren’t deterministic. The quality of a judge depends on the underlying judge’s model, the judge’s prompt, and the use case. If the AI judge isn’t developed properly, it can give misleading evaluations about your application’s performance. AI judges must be evaluated and iterated over time, just like all other AI applications.

The teams with the best products I’ve seen all have human evaluation to supplement their automated evaluation. Every day, they have human experts evaluate a subset of their application’s outputs, which can be anywhere from 30 - 1000 examples.

Daily manual evaluation serves 3 purposes:

Correlate human judgments with AI judgments. If the score by human evaluators is decreasing but the score by AI judges is increasing, you might want to investigate your AI judges.
Gain a better understanding of how users use your application, which can give you ideas to improve your application.
Detect patterns and changes in users’ behaviors, using your knowledge of current events, that automated data exploration might miss.

The reliability of human evaluation also depends on well-crafted annotation guidelines. These annotation guidelines can help improve the model’s instructions (if a human has a hard time following the instructions, the model will, too). It can also be reused to create finetuning data later if you choose to finetune.

In every project I’ve worked on, staring at data for just 15 minutes usually gives me some insight that could save me hours of headaches. Greg Brockman tweeted: “Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning.”

6. Crowdsource use cases

This is a mistake I saw in the early days when enterprises were in a frenzy to adopt generative AI. Unable to come up with a strategy for what use cases to focus on, many tech executives crowdsourced ideas from the whole company. “We hire smart people. Let them tell us what to do.” They then try to implement these ideas one by one.

And that’s how we ended up with a million text-to-SQL models, a million Slack bots, and a billion code plugins.

While it’s indeed a good idea to listen to the smart people that you hire, individuals might be biased toward the problems that immediately affect their day-to-day work instead of problems that might bring the highest returns on investment. Without an overall strategy that considers the big picture, it’s easy to get sidetracked into a series of small, low-impact applications and come to the wrong conclusion that gen AI has no ROI.

Summary

In short, here are the common AI engineering pitfalls:

Use generative AI when you don’t need generative AI
Gen AI isn’t a one-size-fits-all solution to all problems. Many problems don’t even need AI.
Confuse ‘bad product’ with ‘bad AI’
For many AI product, AI is the easy part, product is the hard part.
Start too complex
While fancy new frameworks and finetuning can be useful for many projects, they shouldn’t be your first course of action.
Over-index on early success
Initial success can be misleading. Going from demo-ready to production-ready can take much longer than getting to the first demo.
Forgo human evaluation
AI judges should be validated and correlated with systematic human evaluation.
Crowdsource use cases
Have a big-picture strategy to maximize return on investment.

Read the whole story

peior

20 hours ago

reply

Busy-ness and leverage by Seth Godin
Tuesday January 7^th, 2025 at 9:50 AM

Seth's Blog

When I made breakfast this morning, I didn’t begin by making the blender. Someone else, a team with more skills, resources and scale, built the blender. I simply bought it.

That seems obvious–no one expects a from-scratch baker to make their own baking powder.

And yet, our projects are rarely fine tuned around leverage.

Begin with this question: “What are you hiring yourself to do?”

Are you making that choice because your labor is cheap and convenient, or because it’s the place of maximum leverage? It’s often easier to be busy than it is to be productive.

Busy is a morally superior distraction. Busy gets us off the hook. Busy is a great place to hide.

On the other hand, productive can be scary. When you’re buying someone else’s skill and time, you’re making a different sort of commitment.

Your job might not be to do your job. Your job might be to make the decisions and commitments needed to lead other people who do your (former) job.

The calculation is simple: If the commercial project is worth doing, what’s the most direct, cheapest and fastest way to get it done well?

There’s nothing wrong with hiring yourself to do things you enjoy. And it’s imperative that when you embrace leverage to get projects done, you produce work you’re proud of–shipping junk, at scale, is not the point.

But my guess is that most of us settle for a pattern of leverage that we’re used to, a pace that we’ve become accustomed to, a day filled with tasks we think we’re good at. I’ve talked to people all over the world–entrepreneurs, freelancers, employees and bosses–and most of them are sure that they’re leveraging just the right amount. Even though it’s different for everyone…

The make or buy choice is one we face all day, every day, and rarely consider.

If you’re serious about the project, it’s time to give yourself a promotion, and to hire yourself to do work that’s yours and yours alone to contribute. It’s almost certain that there’s someone cheaper, faster and yes, better at the other work than you are.

On our best days, what we actually make is decisions.

You might need to invest some time and energy to get the skills you need to find this leverage. To be smart about the tools you use and the people you hire. That’s an investment worth making.

Find the resources you need, and figure out how to work with them. Then hire someone else to make a blender.

Read the whole story

peior

10 days ago

reply

What I did in 2024
Wednesday January 1^st, 2025 at 11:37 AM

Red Blob Games: latest blog posts

It's time for my annual self review. In last year's review I said I wanted to improve my site:

fix broken links
organize with tags
improve search
post to my site instead of to social media
move project tracking to my own site

I didn't have any specific goals for writing articles or topics to learn. So what did I do? The biggest thing is that I'm blogging more than in recent years:

Number of blog posts per year (plot from Observable Plot)

Site management

Changes to Twitter and Reddit in recent years have made me think about how I share knowledge. When I share something, I want it to be readable by everyone, forever. I don't want it to be readable only to "members" or "subscribers", like Quora or Medium. I had posted to some of these sites because they were open. But they're sometimes closed now, requiring a login to view what I posted.

My web site has been up for 30 years. The Lindy Effect suggests that what I post to my own site will last longer than what I post to Google+, FriendFeed, MySpace, Reddit, or Twitter. I don't expect Mastodon, Threads, or Bluesky to be up forever either. The article Don't Build Your Castle in Other People's Kingdoms recommends I focus on my own site. But while my own site is easy to post to, my blog hosted by Blogger is not.

I want to make blogging easier for me. I looked at my options for blogging software, and concluded that my web site already supports many of the things I need for a blog. So I decided to write my own blogging software. How hard could it be? Famous last words, right? It's foolish in the same way as "write a game, not a game engine".

But it actually went pretty well! I only had to support the features needed for my own blog, not for everyone's blogs. I didn't need it to scale. I could reuse the existing features I have built for my web site. There are still some features I want to add, but I think I got 80% of what I wanted in <200 lines of Python.

I made it easier to post to my blog, and I posted a lot more this year than in the previous few years. I'm happy about this.

New pages

I sometimes pair a "theory" page with an "implementation" page. The A* theory page describes the algorithms and the A* implementation page describes how to implement them. The Hexagons theory page describes the math and algorithms and the Hexagons implementation page describes how to implement them.

Last year, I studied mouse+touch drag events in the browser and then wrote up a theory page with my recommendations for how to handle the browser events. I claimed that the way I structured the code led to a lot of flexibility in how to handle UI state. This year I made an implementation page with lots of runnable examples showing that flexibility. I show basic dragging, constraints, snapping, svg vs div vs canvas, handles, scrubbable numbers, drawing strokes, painting areas, sharing state, resizing, and Vue components. I show the code for each example, and also link to a runnable CodePen and JSFiddle.

Concepts and implementation pages

I'm very happy with that page, and I wrote a blog post about it.

I also wanted to write a reference page about Bresenham's Line Drawing Algorithm. This page failed. I had started in 2023 with an interactive page that lets you run different implementations of the algorithm, to see how they don't match up. But I realized this year that my motivation for writing that page was anger, not curiosity. My goal was to show that all the implementations were a mess.

But anger isn't a good motivator for me. I don't end up with a good result..

I put the project on hold to let my anger dissipate. Then I started over, wanting to learn it out of curiosity. I re-read the original paper. I read lots of implementations. I took out my interactive visualizations of brokenness. I changed my focus to the properties I might want in a line drawing algorithm.

But I lost motivation again. I asked myself: why am I doing this? and I didn't have a good answer. There are so many things I want to explore, and this topic doesn't seem feel like it's that interesting in the grand scheme of things. So I put it on hold again.

Updates to pages

I treat my main site like a personal wiki. I publish new pages and also improve old pages. I treat my blog differently. I post new pages, but almost never update the existing posts. This year on the main site I made many small updates:

Wrote up what I currently understand about "flow field" pathfinding
Rewrote parts of a page about differential heuristics, but still quite unhappy and thinking about more rewrites
Simplified the implementation of animations in the hexagon guide, when switching from pointy-top to flat-top and back
Added more animation modes to my animated mapgen4. This is a fun page you can just stare at for a while.
Fixed a long-standing bug in A* diagrams - a reader alerted me to mouse positions not quite lining up with tiles, and I discovered that functions like getBoundingClientRect() include the border and padding of an element.
Added a demo of combining distance fields to my page about multiple start points for pathfinding.
Updated my two tutorials on how to make interactive tutorials (1 and 2) to be more consistent, point to each other, and say why you might want one or the other.
Updated my "hello world" opengl+emscripten code with font rendering and other fixes
Continued working on version 3 of my dual-mesh library. I don't plan to make it a standalone project on GitHub until I have used it in a new project, but you can browse the copy of the library inside mapgen4.
Made my hexagon guide printable and also savable for offline use using the browser's "Save As" feature.
Improved typography across my site, including some features that Safari and Firefox support but Chrome still doesn't.
Reduced my use of CDNs after the polyfill.io supply chain attack. I continue to use CDNs for example code that I expect readers to copy/paste.
Switched from yarn to pnpm. I liked yarn 1 but never followed it to yarn 2 or yarn 3, and decided it was time to move away from it.
Made of my pages internally linkable, so you can link to a specific section instead of the whole page.
Used Ruffle's Flash emulator to restore some of the Flash diagrams and demos on my site. When I tried it a few years ago, it couldn't handle most of my swf files, but now it does, hooray!

I didn't remember all of these. I looked through my blog, my notes, and version control history. Here's the git command to go through all my project folders and print out commits from 2024:

for git in $(find . -name .git)
do 
    dir=$(dirname "$git")
    cd "$dir"
    echo ___ "$dir"
    git --no-pager log --since=2024-01-01 --pretty=format:"%as %s%d%n"
    cd - >/dev/null
done

Learning

I decided that I should be focusing more on learning new things for myself, instead of learning things to write a tutorial. The main theme this year was maps:

I made a list of topics related to labels on maps. These were all potential projects.
I ended up spending a lot of time on basic font rendering. What a rabbit hole! Most of the blog posts in 2024 are about font rendering.
I did some small projects using square, triangle, hexagon tiles.
I experimented with generating map features and integrating them into an existing map. For example, instead of generating a map and detecting peninsulas, I might want to say "there will be a peninsula here" so that I can guarantee that one exists, and what size it is.
I tried my hand at gradient descent for solving the parameter dragging problem. In my interactive diagrams, I might have some internal state s that maps into a draggable "handle" on the diagram. We can represent this as a function pos(s₁) returning position p₁. When the handle is moved to a new location p₂, I want to figure out what state s₂ will have pos(s₂) closest to p₂. Gradient descent seems like a reasonable approach to this problem. However, trying to learn it made me realize it's more complicated than it seems, and my math skills are weak.
I wanted to create a starter project for rot.js with Kenney tiles. I was hoping to use this for something, but then never did.
While learning about font rendering, I also got to learn about graphics, antialiasing, sRGB vs linear RGB, gamma correction, WebGL2. This was a rabbit hole in a rabbit hole in a rabbit hole in a rabbit hole…

But secondarily, I got interested in programming language implementation:

I'm reading Crafting Interpreters, Bob Nystrom's book about how to write interpreters and compilers. It's been great so far. I haven't done the exercises yet.
I'm learning more about Web Assembly (wasm). I first got interested in Emscripten in 2011, before wasm or even asm.js. I want to try out some of the new features that became available this year, like garbage collection and tail calls.
I followed part of Tomas Petricek's programming language course, and did the exercises after learning some F#.
I watched some of Ian Piumarta's talks (1, 2) and read some of the papers (Open, extensible composition models from 2011, Making COLAs with Pepsi and Coke "a white-paper advocating widespread, unreasonable behaviour" from 2005)

At the beginning of the year I was following my one-week timeboxing strategy. I've found it's good to prevent me from falling into rabbit holes. But my non-work life took priority, and I ended up relaxing my one-week limits for the rest of the year. I also fell into lots of rabbit holes. I am planning to resume timeboxing next year.

Next year

I want to continue learning lots of new things for myself instead of learning them for writing tutorials. The main theme for 2025 will probably be text:

name generators
large language models
programming languages
procedurally generating code

I also want to continue working on maps. It has been six years since I finished mapgen4, and I am starting to collect ideas for new map projects. I won't do all of these but I have lots of choose from:

towns, nations, cultures, factions, languages
roads, trading routes
farms, oil, gold, ore
valleys, mountain ranges, lakes, peninsulas, plateaus
rivers, coral reefs, caves, chasms, fjords, lagoons
forests, trees, snow, waterfalls, swamps, marshes
soil and rock types
groundwater
atmospheric circulation
ocean currents
tectonic plates
animal and plant types
named areas
icons, stylized drawing
update the graphics code in mapgen4

I don't plan to make a full map generator (but who knows!). Instead, I want to learn techniques and write quick&dirty prototype code. I also plan to continue enhancing my web site structure and build process, including navigation, link checking, project management, bookmarks, more blog features, and maybe sidenotes. Although text and maps are the main themes, I have many more project ideas that I might work on. Happy 2025 everyone!

Read the whole story

peior

16 days ago

reply

The thought that counts by Seth Godin
Sunday December 22^nd, 2024 at 12:45 PM

Seth's Blog

Well, maybe not.

In 2024, worldwide gift card sales will pass a trillion dollars for the first time.

It’s a good grift.

Surveys show that the buyer spends about 21% less per gift than they do when they actually buy something, while the recipients of the gift find themselves spending 61% more than the value of the card when they actually redeem it for money. Most of all, the retailer comes out ahead–far fewer returns, lots of never redeemed cards, better cash flow and new customer accounts when people do show up to eventually buy.

In the current system, the recipient loses. They get a smaller gift, they often spend more money than the gift was for, they’re stuck with the store the giver chose (which is the only thing they actually chose) and there’s very little in the way of thoughtfulness or connection involved.

In essence, holidays become a circle of people, handing the same wad of cash around, except instead of ending up with the cash, they then spend even more money when they go shopping tomorrow.

Every cultural occasion and holiday has been commercialized by retailers in search of more. And the insatiable desire to consume is contagious, and gift giving is inherently viral, since you need to have someone to give the gift to. As a result, we’ve built a system that’s expensive and not particularly good at what it sets out to do.

Given the size and profitability of the cards, I’m surprised that they’re not a much better experience.

What might a better process look like?

Go the the online store, find an item you think a friend would like. Instead of ordering it, choose GIFT CARD.
The store asks you if you’d like to purchase a charitable donation add on as well.
Now, the site produces a unique digital gift card, with a picture of the item and a link to redeem it. The QR code it generates also includes a thank you from the charity.
Your friend simply has to scan the lovely page you printed out (or emailed them) to go to the redeem page. Once there, they can choose to get the item you carefully picked out, choose something else or easily get cash back.
And so, they get delighted three times: When they get the thoughtful card. When they go to the site and discover they can get the cash back. And when the item arrives in the post and they unwrap it.

Now the thought really does count. This is a low hassle, high delight way to show someone you were thinking of them. If stores used their persuasive powers, it could also raise billions for worthy causes along the way.

Either that, or you could give cash and save everyone a lot of trouble.

Read the whole story

peior

26 days ago

reply

o1 tops new aider polyglot leaderboard
Sunday December 22^nd, 2024 at 12:43 PM

aider

December 21, 2024

o1 tops new aider polyglot leaderboard

OpenAI’s new o1 model with “high” reasoning effort gets the top score on the new aider polyglot leaderboard, significantly ahead of other top LLMs. The new polyglot benchmark was designed to be much more challenging than aider’s old code editing benchmark. This more clearly distinguishes the performance of today’s strongest coding models and leaves headroom for future LLMs.

The polyglot benchmark

Like aider’s original code editing benchmark, the new polyglot benchmark is based on Exercism coding exercises.

The new polyglot benchmark:

Contains coding problems in C++, Go, Java, JavaScript, Python and Rust. The old benchmark was solely based on Python exercises.
Focuses on the most difficult 225 exercises out of the 697 that Exercism provides for those languages. The old benchmark simply included all 133 Python exercises, regardless of difficulty.

Motivation and goals

Aider’s original code editing benchmark was saturating as the top scores approached and then surpassed 80%. Sonnet’s score of 84.2% was based on solving 112 of the 133 exercises, leaving only 21 unsolved exercises. New champions were advancing the top score by solving just 1-2 more problems than the previous record. This made it hard to clearly measure the difference in code editing skill between these top models.

Part of the problem is that many of the original 133 Python problems are very easy and provide little challenge to today’s frontier LLMs. Models as old as GPT 3.5 Turbo were able to solve half of the 133 problems. Such easy problems simply inflate the benchmark scores of modern LLMs without providing any data about which models are better or worse.

The main goal for a new benchmark was to re-calibrate the scale so that today’s top coding LLMs would occupy a wide range of scores between about 5% and 50%. This should leave headroom for future LLMs and make it possible to more clearly compare the relative performance of top models.

Designing the polyglot benchmark

The new benchmark:

Tests LLMs with more coding languages, to increase diversity and source a larger pool of problems.
Includes just the most challenging coding problems and excludes easy problems that are solvable by most of today’s top coding LLMs.
Includes more total coding problems, to enable more granularity of comparison.

The new benchmark is based on Exercism coding problems from 6 of the most popular programming languages:

C++
Go
Java
JavaScript
Python
Rust

Exercism provides a total of 697 coding problems in those 6 languages. A set of 7 of today’s top coding models each attempted all 697 of the Exercism problems:

Sonnet
Haiku
o1 Mini
DeepSeek
GPT-4o
Qwen 32B Coder Instruct
GPT-4o Mini

Depending on the difficulty of the problems, a different number of solutions were found by the collection of 7 models:

Solutions found	Number of problems	Cumulative number of problems
0	66	66
1	61	127
2	50	177
3	48	225
4	53	278
5	71	349
6	90	439
7	258	697

In the table above, you can see that 258 of the problems were solved by all 7 LLMs. These problems are far too easy, and wouldn’t be good choices for the new benchmark. Instead, we need hard problems like the 66 that none of the 7 models were able to solve.

The new benchmark uses the 225 problems that were solved by 3 or fewer models. This achieves a balance between hard and moderate problems, and provides a large but not excessive total pool of problems. It also represents a good diversity of coding languages:

Language	Problems
C++	26
Go	39
Java	47
JavaScript	49
Python	34
Rust	30
Total	225

o1

OpenAI’s new o1 model established a very strong top score of 62% on the new benchmark. This still leaves 86 problems of headroom for future models to solve. Given the incredible pace of recent advancements, it will be interesting to see how long it will take for this new benchmark to saturate.

Benchmark problems

The 225 coding problems are available in the aider polyglot benchmark repo on GitHub.

Results

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
o1-2024-12-17	61.7%	91.5%	`aider --model openrouter/openai/o1`	diff
claude-3-5-sonnet-20241022	45.3%	100.0%	`aider --model claude-3-5-sonnet-20241022`	diff
claude-3-5-haiku-20241022	28.0%	91.1%	`aider --model claude-3-5-haiku-20241022`	diff
deepseek-chat	17.8%	92.9%	`aider --model deepseek/deepseek-chat`	diff
gpt-4o-2024-11-20	15.1%	96.0%	`aider --model gpt-4o-2024-11-20`	diff
Qwen2.5-Coder-32B-Instruct	8.0%	71.6%	`aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct # via hyperbolic`	diff
gpt-4o-mini-2024-07-18	3.6%	100.0%	`aider --model gpt-4o-mini-2024-07-18`	whole

Read the whole story

peior

26 days ago

reply

December in LLMs has been a lot
Friday December 20^th, 2024 at 2:40 AM

Simon Willison's Weblog

I had big plans for December: for one thing, I was hoping to get to an actual RC of Datasette 1.0, in preparation for a full release in January. Instead, I've found myself distracted by a constant barrage of new LLM releases.

On December 4th Amazon introduced the Amazon Nova family of multi-modal models - clearly priced to compete with the excellent and inexpensive Gemini 1.5 series from Google. I got those working with LLM via a new llm-bedrock plugin.

The next big release was Llama 3.3 70B-Instruct, on December 6th. Meta claimed that this 70B model was comparable in quality to their much larger 405B model, and those claims seem to hold weight.

I wrote about how I can now run a GPT-4 class model on my laptop - the same laptop that was running a GPT-3 class model just 20 months ago.

Llama 3.3 70B has started showing up from API providers now, including super-fast hosted versions from both Groq (276 tokens/second) and Cerebras (a quite frankly absurd 2,200 tokens/second). If you haven't tried Val Town's Cerebras Coder demo you really should.

I think the huge gains in model efficiency are one of the defining stories of LLMs in 2024. It's not just the local models that have benefited: the price of proprietary hosted LLMs has dropped through the floor, a result of both competition between vendors and the increasing efficiency of the models themselves.

Last year the running joke was that every time Google put out a new Gemini release OpenAI would ship something more impressive that same day to undermine them.

The tides have turned! This month Google shipped three updates that took the wind out of OpenAI's sails.

The first was Gemini 2.0 Flash on the 11th of December, the first release in Google's Gemini 2.0 series. The streaming support was particularly impressive, with https://aistudio.google.com/live demonstrating streaming audio and webcam communication with the multi-modal LLM a full day before OpenAI released their own streaming camera/audio features in an update to ChatGPT.

Then this morning Google shipped Gemini 2.0 Flash "Thinking mode", their version of the inference scaling technique pioneered by OpenAI's o1. I did not expect Gemini to ship a version of that before 2024 had even ended.

OpenAI have one day left in their 12 Days of OpenAI event. Previous highlights have included the full o1 model (an upgrade from o1-preview) and o1-pro, Sora (later upstaged a week later by Google's Veo 2), Canvas (with a confusing second way to run Python), Advanced Voice with video streaming and Santa and a very cool new WebRTC streaming API, ChatGPT Projects (pretty much a direct lift of the similar Claude feature) and the 1-800-CHATGPT phone line.

Tomorrow is the last day. I'm not going to try to predict what they'll launch, but I imagine it will be something notable to close out the year.

Blog entries

Releases

llm-gemini 0.8 - 2024-12-19
LLM plugin to access Google's Gemini family of models
datasette-enrichments-slow 0.1 - 2024-12-18
An enrichment on a slow loop to help debug progress bars
llm-anthropic 0.11 - 2024-12-17
LLM access to models by Anthropic, including the Claude series
llm-openrouter 0.3 - 2024-12-08
LLM plugin for models hosted by OpenRouter
prompts-js 0.0.4 - 2024-12-08
async alternatives to browser alert() and prompt() and confirm()
datasette-enrichments-llm 0.1a0 - 2024-12-05
Enrich data by prompting LLMs
llm 0.19.1 - 2024-12-05
Access large language models from the command-line
llm-bedrock 0.4 - 2024-12-04
Run prompts against models hosted on AWS Bedrock
datasette-queries 0.1a0 - 2024-12-03
Save SQL queries in Datasette
datasette-llm-usage 0.1a0 - 2024-12-02
Track usage of LLM tokens in a SQLite table
llm-mistral 0.9 - 2024-12-02
LLM plugin providing access to Mistral models using the Mistral API
llm-claude-3 0.10 - 2024-12-02
LLM plugin for interacting with the Claude 3 family of models
datasette 0.65.1 - 2024-11-29
An open source multi-tool for exploring and publishing data
sqlite-utils-ask 0.2 - 2024-11-24
Ask questions of your data with LLM assistance
sqlite-utils 3.38 - 2024-11-23
Python CLI utility and library for manipulating SQLite databases

TILs

Fixes for datetime UTC warnings in Python - 2024-12-12
Publishing a simple client-side JavaScript package to npm with GitHub Actions - 2024-12-08
GitHub OAuth for a static site using Cloudflare Workers - 2024-11-29

Tags: google, ai, weeknotes, openai, generative-ai, chatgpt, llms, gemini, o1

Read the whole story

peior

28 days ago

reply

Common pitfalls when building generative AI applications Friday January 17th, 2025 at 2:58 AM

1. Use generative AI when you don't need generative AI

2. Confuse 'bad product' with 'bad AI'

3. Start too complex

4. Over-index on early success

5. Forgo human evaluation

6. Crowdsource use cases

Summary

Busy-ness and leverage by Seth Godin Tuesday January 7th, 2025 at 9:50 AM

What I did in 2024 Wednesday January 1st, 2025 at 11:37 AM

Site management

New pages

Updates to pages

Learning

Next year

The thought that counts by Seth Godin Sunday December 22nd, 2024 at 12:45 PM

o1 tops new aider polyglot leaderboard Sunday December 22nd, 2024 at 12:43 PM

o1 tops new aider polyglot leaderboard

The polyglot benchmark

Motivation and goals

Designing the polyglot benchmark

o1

Benchmark problems

Results

December in LLMs has been a lot Friday December 20th, 2024 at 2:40 AM

Blog entries

Releases

TILs

Common pitfalls when building generative AI applications
Friday January 17^th, 2025 at 2:58 AM

Busy-ness and leverage by Seth Godin
Tuesday January 7^th, 2025 at 9:50 AM

What I did in 2024
Wednesday January 1^st, 2025 at 11:37 AM

The thought that counts by Seth Godin
Sunday December 22^nd, 2024 at 12:45 PM

o1 tops new aider polyglot leaderboard
Sunday December 22^nd, 2024 at 12:43 PM

December in LLMs has been a lot
Friday December 20^th, 2024 at 2:40 AM