70 stories
·
0 followers

Understanding Reasoning LLMs

1 Share

This article describes the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic.

In 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., "specializations").

Stages 1-3 are the common steps to developing LLMs. Stage 4 specializes LLMs for specific use cases.

The development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later.

To give you a brief glimpse of what's covered below, in this article, I will:

  1. Explain the meaning of "reasoning model"

  2. Discuss the advantages and disadvantages of reasoning models

  3. Outline the methodology behind DeepSeek R1

  4. Describe the four main approaches to building and improving reasoning models

  5. Share thoughts on the LLM landscape following the DeepSeek V3 and R1 releases

  6. Provide tips for developing reasoning models on a tight budget

I hope you find this article useful as AI continues its rapid development this year!

How do we define "reasoning model"?

If you work in AI (or machine learning in general), you are probably familiar with vague and hotly debated definitions. The term "reasoning models" is no exception. Eventually, someone will define it formally in a paper, only for it to be redefined in the next, and so on.

In this article, I define "reasoning" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like "What is the capital of France?" does not involve reasoning. In contrast, a question like "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer.

A regular LLM may only provide a short answer (as shown on the left), whereas reasoning models typically include intermediate steps that reveal part of the thought process. (Note that many LLMs who have not been specifically developed for reasoning tasks can also provide intermediate reasoning steps in their answers.)

Most modern LLMs are capable of basic reasoning and can answer questions like, "If a train is moving at 60 mph and travels for 3 hours, how far does it go?" So, today, when we refer to reasoning models, we typically mean LLMs that excel at more complex reasoning tasks, such as solving puzzles, riddles, and mathematical proofs.

Additionally, most LLMs branded as reasoning models today include a "thought" or "thinking" process as part of their response. Whether and how an LLM actually "thinks" is a separate discussion.

Intermediate steps in reasoning models can appear in two ways. First, they may be explicitly included in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI's o1, run multiple iterations with intermediate steps that are not shown to the user.

"Reasoning" is used at two different levels: 1) processing the input and generating via multiple intermediate steps and 2) providing some sort of reasoning as part of the response to the user.

When should we use reasoning models?

Now that we have defined reasoning models, we can move on to the more interesting part: how to build and improve LLMs for reasoning tasks. However, before diving into the technical details, it is important to consider when reasoning models are actually needed.

When do we need a reasoning model? Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to "overthinking." Also here the simple rule applies: Use the right tool (or type of LLM) for the task.

The key strengths and limitations of reasoning models are summarized in the figure below.

The key strengths and weaknesses of reasoning models.

A brief look at the DeepSeek training pipeline

Before discussing four main approaches to building and improving reasoning models in the next section, I want to briefly outline the DeepSeek R1 pipeline, as described in the DeepSeek R1 technical report. This report serves as both an interesting case study and a blueprint for developing reasoning LLMs.

Note that DeepSeek did not release a single R1 reasoning model but instead introduced three distinct variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill.

Based on the descriptions in the technical report, I have summarized the development process of these models in the diagram below.

Development process of DeepSeeks three different reasoning models that are discussed in the DeepSeek R1 technical report.

Next, let's briefly go over the process shown in the diagram above. More details will be covered in the next section, where we discuss the four main approaches to building and improving reasoning models.

(1) DeepSeek-R1-Zero: This model is based on the 671B pre-trained DeepSeek-V3 base model released in December 2024. The research team trained it using reinforcement learning (RL) with two types of rewards. This approach is referred to as "cold start" training because it did not include a supervised fine-tuning (SFT) step, which is typically part of reinforcement learning with human feedback (RLHF).

(2) DeepSeek-R1: This is DeepSeek's flagship reasoning model, built upon DeepSeek-R1-Zero. The team further refined it with additional SFT stages and further RL training, improving upon the "cold-started" R1-Zero model.

(3) DeepSeek-R1-Distill*: Using the SFT data generated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to enhance their reasoning abilities. While not distillation in the traditional sense, this process involved training smaller models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the larger DeepSeek-R1 671B model.

The 4 main ways to build and improve reasoning models

In this section, I will outline the key techniques currently used to enhance the reasoning capabilities of LLMs and to build specialized reasoning models such as DeepSeek-R1, OpenAI's o1 & o3, and others.

Note: The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques.

1) Inference-time scaling

One way to improve an LLM's reasoning capabilities (or any capability in general) is inference-time scaling. This term can have multiple meanings, but in this context, it refers to increasing computational resources during inference to improve output quality.

A rough analogy is how humans tend to generate better responses when given more time to think through complex problems. Similarly, we can apply techniques that encourage the LLM to "think" more while generating an answer. (Although, whether LLMs actually "think" is a different discussion.)

One straightforward approach to inference-time scaling is clever prompt engineering. A classic example is chain-of-thought (CoT) prompting, where phrases like "think step by step" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems. (Note that it doesn't make sense to employ this strategy for simpler knowledge-based questions, like "What is the capital of France", which is again a good rule of thumb to find out whether a reasoning model makes sense on your given input query.)

An example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (https://arxiv.org/abs/2205.11916).

The aforementioned CoT approach can be seen as inference-time scaling because it makes inference more expensive through generating more output tokens.

Another approach to inference-time scaling is the use of voting and search strategies. One simple example is majority voting where we have the LLM generate multiple answers, and we select the correct answer by majority vote. Similarly, we can use beam search and other search algorithms to generate better responses.

I highly recommend the Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters paper that I described in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2) for more details on these different strategies.

Different search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, https://arxiv.org/abs/2408.03314

The DeepSeek R1 technical report states that its models do not use inference-time scaling. However, this technique is often implemented at the application layer on top of the LLM, so it is possible that DeepSeek applies it within their app.

I suspect that OpenAI's o1 and o3 models use inference-time scaling, which would explain why they are relatively expensive compared to models like GPT-4o. In addition to inference-time scaling, o1 and o3 were likely trained using RL pipelines similar to those used for DeepSeek R1. More on reinforcement learning in the next two sections below.

2) Pure reinforcement learning (RL)

One of my personal highlights from the DeepSeek R1 paper is their discovery that reasoning emerges as a behavior from pure reinforcement learning (RL). Let's explore what this means in more detail.

As outlined earlier, DeepSeek developed three types of R1 models. The first, DeepSeek-R1-Zero, was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they released in December 2024. Unlike typical RL pipelines, where supervised fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained exclusively with reinforcement learning without an initial SFT stage as highlighted in the diagram below.

The development process of DeepSeek-R1-Zero model.

Still, this RL process is similar to the commonly used RLHF approach, which is typically applied to preference-tune LLMs. (I covered RLHF in more detail in my article, LLM Training: RLHF and Its Alternatives.) However, as mentioned above, the key difference in DeepSeek-R1-Zero is that they skipped the supervised fine-tuning (SFT) stage for instruction tuning. This is why they refer to it as "pure" RL. (Although, RL in the context of LLMs differs significantly from traditional RL, which is a topic for another time.)

For rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward.

  • The accuracy reward uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses.

  • The format reward relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside <think> tags.

Surprisingly, this approach was enough for the LLM to develop basic reasoning skills. The researchers observed an "Aha!" moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so, as shown in the figure below.

A figure from the DeepSeek R1 technical report (https://arxiv.org/abs/2501.12948) showing the emergence of the "Aha" moment.

While R1-Zero is not a top-performing reasoning model, it does demonstrate reasoning capabilities by generating intermediate "thinking" steps, as shown in the figure above. This confirms that it is possible to develop a reasoning model using pure RL, and the DeepSeek team was the first to demonstrate (or at least publish) this approach.

Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

3) Supervised finetuning and reinforcement learning (SFT + RL)

Next, let's look at the development of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for building reasoning models. This model improves upon DeepSeek-R1-Zero by incorporating additional supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance.

Note that it is actually common to include an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI's o1 was likely developed using a similar approach.

The development process of DeepSeek-R1 model.

As shown in the diagram above, the DeepSeek team used DeepSeek-R1-Zero to generate what they call "cold-start" SFT data. The term "cold start" refers to the fact that this data was produced by DeepSeek-R1-Zero, which itself had not been trained on any supervised fine-tuning (SFT) data.

Using this cold-start SFT data, DeepSeek then trained the model via instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the same accuracy and format rewards used in DeepSeek-R1-Zero’s RL process. However, they added a consistency reward to prevent language mixing, which occurs when the model switches between multiple languages within a response.

The RL stage was followed by another round of SFT data collection. In this phase, the most recent model checkpoint was used to generate 600K Chain-of-Thought (CoT) SFT examples, while an additional 200K knowledge-based SFT examples were created using the DeepSeek-V3 base model.

These 600K + 200K SFT samples were then used for another round of RL. In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types.

The final model, DeepSeek-R1 has a noticeable performance boost over DeepSeek-R1-Zero thanks to the additional SFT and RL stages, as shown in the table below.

Benchmark comparison of OpenAI A1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948).

4) Pure supervised finetuning (SFT) and distillation

So far, we have covered three key approaches to building and improving reasoning models:

1. Inference-time scaling, a technique that improves reasoning capabilities without training or otherwise modifying the underlying model.

2. Pure reinforcement learning (RL) as in DeepSeek-R1-Zero, which showed that reasoning can emerge as a learned behavior without supervised fine-tuning.

3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model.

So, what’s left? Model "distillation."

Surprisingly, DeepSeek also released smaller models trained via a process they call distillation. However, in the context of LLMs, distillation does not necessarily follow the classical knowledge distillation approach used in deep learning. Traditionally, in knowledge distillation (as briefly described in Chapter 6 of my Machine Learning Q and AI book), a smaller student model is trained on both the logits of a larger teacher model and a target dataset.

Instead, here distillation refers to instruction fine-tuning smaller LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset generated by larger LLMs. Specifically, these larger LLMs are DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. In fact, the SFT data used for this distillation process is the same dataset that was used to train DeepSeek-R1, as described in the previous section.

To clarify this process, I have highlighted the distillation portion in the diagram below.

The development process of DeepSeek-R1-Distill models.

Why did they develop these distilled models? In my opinion, there are two key reasons:

1. Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me.

2. A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning.

The table below compares the performance of these distilled models against other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1.

Benchmark comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948).

As we can see, the distilled models are noticeably weaker than DeepSeek-R1, but they are surprisingly strong relative to DeepSeek-R1-Zero, despite being orders of magnitude smaller. It's also interesting to note how well these models perform compared to o1 mini (I suspect o1-mini itself might be a similarly distilled version of o1).

Before wrapping up this section with a conclusion, there’s one more interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior seen in DeepSeek-R1-Zero could also appear in smaller models. To investigate this, they applied the same pure RL approach from DeepSeek-R1-Zero directly to Qwen-32B.

The results of this experiment are summarized in the table below, where QwQ-32B-Preview serves as a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team (I think the training details were never disclosed). This comparison provides some additional insights into whether pure RL alone can induce reasoning capabilities in models much smaller than DeepSeek-R1-Zero.

Benchmark comparison distillation and RL on a smaller 32B model. Annotated figure from the DeepSeek-R1 technical report (https://arxiv.org/abs/2501.12948).

Interestingly, the results suggest that distillation is far more effective than pure RL for smaller models. This aligns with the idea that RL alone may not be sufficient to induce strong reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when working with small models.

For completeness, it would have been useful to see additional comparisons in the table:

1. Qwen-32B trained with SFT + RL, similar to how DeepSeek-R1 was developed. This would help determine how much improvement can be made, compared to pure RL and pure SFT, when RL is combined with SFT.

2. DeepSeek-V3 trained with pure SFT, similar to how the distilled models were created. This would allow for a direct comparison to see how effective RL + SFT is over pure SFT.

Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Conclusion

In this section, we explored four different strategies for building and improving reasoning models:

1. Inference-time scaling requires no additional training but increases inference costs, making large-scale deployment more expensive as the number or users or query volume grows. Still, it remains a no-brainer for improving the performance of already strong models. I strongly suspect that o1 leverages inference-time scaling, which helps explain why it is more expensive on a per-token basis compared to DeepSeek-R1.

2. Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in practical model development, RL + SFT is the preferred approach as it leads to stronger reasoning models. I strongly suspect that o1 was trained using RL + SFT as well. More precisely, I believe o1 starts from a weaker, smaller base model than DeepSeek-R1 but compensates with RL + SFT and inference-time scaling.

3. As mentioned above, RL + SFT is the key approach for building high-performance reasoning models. DeepSeek-R1 is a nice blueprint showing how this can be done.

4. Distillation is an attractive approach, especially for creating smaller, more efficient models. However, the limitation is that distillation does not drive innovation or produce the next generation of reasoning models. For instance, distillation always depends on an existing, stronger model to generate the supervised fine-tuning (SFT) data.

One interesting aspect I expect to see next is to combine RL + SFT (approach 3) with inference-time scaling (approach 1). This is likely what OpenAI o1 is doing, except it's probably based on a weaker base model than DeepSeek-R1, which explains why DeepSeek-R1 performs so well while remaining relatively cheap at inference time.

Thoughts about DeepSeek R1

In recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In short, I think they are an awesome achievement. As a research engineer, I particularly appreciate the detailed technical report, which provides insights into their methodology that I can learn from.

One of the most fascinating takeaways is how reasoning emerged as a behavior from pure RL. And it's impressive that DeepSeek has open-sourced their models under a permissive open-source MIT license, which has even fewer restrictions than Meta's Llama models.

How does it compare to o1?

Is DeepSeek-R1 better than o1? I’d say it’s roughly in the same ballpark. However, what stands out is that DeepSeek-R1 is more efficient at inference time. This suggests that DeepSeek likely invested more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1.

That said, it's difficult to compare o1 and DeepSeek-R1 directly because OpenAI has not disclosed much about o1. For instance, we don’t know:

  • Is o1 also a Mixture of Experts (MoE)?

  • How large is o1?

  • Could o1 just be a slightly refined version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling?

Without knowing these details, a direct comparison remains an apples-to-oranges comparison.

The cost of training DeepSeek-R1

Another point of discussion has been the cost of developing DeepSeek-R1. Some have mentioned a ~$6 million training cost, but they likely conflated DeepSeek-V3 (the base model released in December last year) and DeepSeek-R1.

The $6 million estimate is based on an assumed $2 per GPU hour and the number of GPU hours required for the final training run of DeepSeek-V3, which was originally discussed back in December 2024.

However, the DeepSeek team has never disclosed the exact GPU hours or development cost for R1, so any cost estimates remain pure speculation.

Either way, ultimately, DeepSeek-R1 is a major milestone in open-weight reasoning models, and its efficiency at inference time makes it an interesting alternative to OpenAI’s o1.

Developing reasoning models on a limited budget

Developing a DeepSeek-R1-level reasoning model likely requires hundreds of thousands to millions of dollars, even when starting with an open-weight base model like DeepSeek-V3. This can feel discouraging for researchers or engineers working with limited budgets.

The good news: Distillation can go a long way

Fortunately, model distillation offers a more cost-effective alternative. The DeepSeek team demonstrated this with their R1-distilled models, which achieve surprisingly strong reasoning performance despite being significantly smaller than DeepSeek-R1. However, even this approach isn’t entirely cheap. Their distillation process used 800K SFT samples, which requires substantial compute.

Interestingly, just a few days before DeepSeek-R1 was released, I came across an article about Sky-T1, a fascinating project where a small team trained an open-weight 32B model using only 17K SFT samples. The total cost? Just $450, which is less than the registration fee for most AI conferences.

This example highlights that while large-scale training remains expensive, smaller, targeted fine-tuning efforts can still yield impressive results at a fraction of the cost.

Figure from the "Sky-T1: Train your own O1 preview model within $450" article, https://novasky-ai.github.io/posts/sky-t1/

According to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost.

Pure RL on a budget: TinyZero

While Sky-T1 focused on model distillation, I also came across some interesting work in the "pure RL" space. One notable example is TinyZero, a 3B parameter model that replicates the DeepSeek-R1-Zero approach (side note: it costs less than $30 to train).

Surprisingly, even at just 3B parameters, TinyZero exhibits some emergent self-verification abilities, which supports the idea that reasoning can emerge through pure RL, even in small models.

The TinyZero repository mentions that a research report is still work in progress, and I’ll definitely be keeping an eye out for further details.

A figure from the TinyZero repository (https://github.com/Jiayi-Pan/TinyZero) showing that the model is capable of self-verification. (It would have been interesting to see the response of the base model in comparison.)

The two projects mentioned above demonstrate that interesting work on reasoning models is possible even with limited budgets. While both approaches replicate methods from DeepSeek-R1, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1), it would be fascinating to explore how these ideas can be extended further.

Beyond Traditional SFT: Journey Learning

One particularly interesting approach I came across last year is described in the paper O1 Replication Journey: A Strategic Progress Report – Part 1. Despite its title, the paper does not actually replicate o1. Instead, it introduces an different way to improve the distillation (pure SFT) process.

The key idea in the paper is "journey learning" as an alternative to "shortcut learning."

  • Shortcut learning refers to the traditional approach in instruction fine-tuning, where models are trained using only correct solution paths.

  • Journey learning, on the other hand, also includes incorrect solution paths, allowing the model to learn from mistakes.

This approach is kind of related to the self-verification abilities observed in TinyZero’s pure RL training, but it focuses on improving the model entirely through SFT. By exposing the model to incorrect reasoning paths and their corrections, journey learning may also reinforce self-correction abilities, potentially making reasoning models more reliable this way.

Journey learning, as opposed to traditional shortcut learning, includes wrong solutions paths in the SFT data. Annotated figure from the O1 Replication Journey: A Strategic Progress Report – Part 1 (https://arxiv.org/abs/2410.18982)

This could be an exciting direction for future work, particularly for low-budget reasoning model development, where RL-based approaches may be computationally impractical.

Anyways, a lot of interesting work is currently happening on the reasoning model front, and I'm sure we will see a lot more exciting work in the upcoming months!


This magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my Build a Large Language Model (From Scratch) book. (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.)

Build a Large Language Model (From Scratch) now available on Amazon

If you read the book and have a few minutes to spare, I'd really appreciate a brief review. It helps us authors a lot!

Your support means a great deal! Thank you!

Read the whole story
peior
4 days ago
reply
Share this story
Delete

“Can’t complain” (but it might be worth considering)

1 Share

Complaining is a cultural phenomenon, but it’s particularly prevalent in societies with a consumer culture (the customer is always right) and those where comfort is coming to be expected.

Given all the complaining we do (about the weather, leadership, products, service and various ailments), it’s worth taking a moment to think about why we complain.

The obvious one might not be the main one.

The obvious reason to complain is to make a change happen.

If that’s the goal, though, we ought to focus those complaints where they’ll do the most good, and be prepared to do the work to have an impact. Organize the others, take consistent and persistent action, and market the complaint in a format and with a focus that will lead to action.

Most of the time, though, I’m not sure that’s what we’re really after.

Here are some others:

  1. to let off steam
  2. to signal group affiliation
  3. to create hope that things might get better
  4. to increase one’s status by selfishly demanding more
  5. to gain affiliation by complaining on behalf of someone else
  6. to gain status by demanding more for others who can’t speak up
  7. to validate our feelings by seeking acknowledgment from others that their grievance is legitimate
  8. to preemptively lower expectations or manage blame
  9. to conceal our fear or embarrassment
  10. to avoid responsibility by pointing to someone else
  11. to establish dominance or control in a situation
  12. to bond with others through shared experiences of dissatisfaction

Not on the list, because it belies almost all of these: “Whining in the face of imperfection often ruins what you’ve already got.”

Whining is the evil cousin of complaining. Whining purports to exist to make things better, but it never does.

James Murphy of LCD Soundsystem said, “The best way to complain is to make things.”

And perhaps we can extend that to: “The best way to complain is to make things better.”

Read the whole story
peior
7 days ago
reply
Share this story
Delete

The Screen Disease: Why Your Digital Life Is Killing Your Human Experience

1 Share
Last week, I discovered something terrifying about myself. I decided to track my actual screen time – not just my phone, but every single digital interface in my life. The results were… well, disturbing. Out of my 16 waking hours on weekdays, I spent 14 of them staring at some form of screen. Phone: 5 […]



Download video: https://www.youtube.com/embed/itYXNALrVss
Read the whole story
peior
12 days ago
reply
Share this story
Delete

Open-R1: a fully open reproduction of DeepSeek-R1

1 Share
Read the whole story
peior
12 days ago
reply
Share this story
Delete

Memo to the future

1 Share

The experience of the now is often more vivid than a distant memory. As a result, we can make decisions in the future without enough regard for how we felt the last time we were in a similar situation.

Here’s a simple hack that can inform your decisions…

You know someone who recently got the flu. Perhaps they were sick in bed for weeks, or even needed medical attention… Write down what happened (and how it made you feel) and put it in your calendar for September 16th. That way, nine months from now, when you’re thinking of getting a flu shot, the reminder will be right there for you.

Did you leave work an hour early to spend time with friends instead? Take some pictures and add that reminder to your calendar for two months from now, a useful way to get out of your daily work rut.

One more: the next time that cold and rain doesn’t keep you from an outdoor walk, drop yourself a note for next week, reminding yourself of how good it was to get up and get out.

It’s not a diary you put on the shelf. It’s a diary entry you send to yourself in the future.

The future unfolds, with or without us, but that doesn’t mean we can’t bend it in a useful direction.

Read the whole story
peior
22 days ago
reply
Share this story
Delete

Common pitfalls when building generative AI applications

1 Share

As we’re still in the early days of building applications with foundation models, it’s normal to make mistakes. This is a quick note with examples of some of the most common pitfalls that I’ve seen, both from public case studies and from my personal experience.

Because these pitfalls are common, if you’ve worked on any AI product, you’ve probably seen them before.

1. Use generative AI when you don't need generative AI

Every time there’s a new technology, I can hear the collective sigh of senior engineers everywhere: “Not everything is a nail.” Generative AI isn’t an exception — its seemingly limitless capabilities only exacerbate the tendency to use generative AI for everything.

A team pitched me the idea of using generative AI to optimize energy consumption. They fed a household’s list of energy-intensive activities and hourly electricity prices into an LLM, then asked it to create a schedule to minimize energy costs. Their experiments showed that this could help reduce a household’s electricity bill by 30%. Free money. Why wouldn’t anyone want to use their app?

I asked: “How does it compare to simply scheduling the most energy-intensive activities when electricity is cheapest? Say, doing your laundry and charging your car after 10pm?”

They said they would try it later and let me know. They never followed up, but they abandoned this app soon after. I suspect that this greedy scheduling can be quite effective. Even if it’s not, there are other much cheaper and more reliable optimization solutions than generative AI, like linear programming.

I’ve seen this scenario over and over again. A big company wants to use generative AI to detect anomalies in network traffic. Another wants to predict upcoming customer call volume. A hospital wants to detect whether a patient is malnourished (really not recommended).

It can often be beneficial to explore a new approach to get a sense of what’s possible, as long as you’re aware that your goal isn’t to solve a problem but to test a solution. “We solve the problem” and “We use generative AI” are two very different headlines, and unfortunately, so many people would rather have the latter.

2. Confuse 'bad product' with 'bad AI'

At the other end of the spectrum, many teams dismiss gen AI as a valid solution for their problems because they tried it out and their users hated it. However, other teams successfully used gen AI for similar use cases. I could only look into two of these teams. In both cases, the issue wasn’t with AI, but with product.

Many people have told me that the technical aspects of their AI applications are straightforward. The hard part is user experience (UX). What should the product interface look like? How to seamlessly integrate the product into the user workflow? How to incorporate human-in-the-loop?

UX has always been challenging, but it’s even more so with generative AI. While we know that generative AI is changing how we read, write, learn, teach, work, entertain, etc., we don’t quite know how yet. What will the future of reading/learning/working be like?

Here are some simple examples to show how what users want can be counter-intuitive and need rigorous user study.

  1. My friend works on an application that summarizes meeting transcripts. Initially, her team focused on getting the right summary length. Would users prefer 3-sentence summaries or 5-sentence summaries?

    However, it turned out that their users didn’t care about the actual summary. They only wanted action items specific to them from each meeting.

  2. When LinkedIn developed a chatbot for skill fit assessment, they discovered that users didn’t want correct responses. Users wanted helpful responses.

    For example, if a user asks a bot whether they’re a fit for a job and the bot responds with: “You’re a terrible fit,” this response might be correct but not very helpful to the user. Users want tips on what the gaps are and what they can do to close the gaps.

  3. Intuit built a chatbot to help users answer tax questions. Initially, they got lukewarm feedback — users didn’t find the bot useful. After investigation, they found out that users actually hated typing. Facing a blank chatbot, users didn’t know what the bot could do and what to type.

    So, for each interaction, Intuit added a few suggested questions for users to click on. This reduced the friction for users to use the bot and gradually built users’ trust. The feedback from users then became much more positive.
    (Shared by Nhung Ho, VP of AI at Intuit, during our panel at Grace Hopper.)

Because everyone uses the same models nowadays, the AI components of AI products are similar, and the differentiation is product.

3. Start too complex

Examples of this pitfall:

  1. Use an agentic framework when direct API calls work.
  2. Agonize over what vector database to use when a simple term-based retrieval solution (that doesn’t require a vectordb) works.
  3. Insist on finetuning when prompting works.
  4. Use semantic caching.

Given so many shiny new technologies, it’s tempting to jump straight into using them. However, incorporating external tools too early can cause 2 problems:

  1. Abstract away critical details, making it hard to understand and debug your system.
  2. Introduce unnecessary bugs.

Tool developers can make mistakes. For example, I often find typos in default prompts when reviewing a framework’s codebase. If the framework you use updates its prompt without telling you, your application’s behaviors might change and you might not know why.

Eventually, abstractions are good. But abstractions need to incorporate best practices and be tested overtime. As we’re still in the early days of AI engineering, best practices are still evolving, we should be more vigilant when adopting any abstraction.

4. Over-index on early success

  1. It took LinkedIn 1 month to achieve 80% of the experience they wanted, and an additional 4 months to surpass 95%. The initial success made them grossly underestimate how challenging it is to improve the product, especially around hallucinations. They found it discouraging how difficult it was to achieve each subsequent 1% gain.

  2. A startup that develops AI sales assistants for ecommerce told me that getting from 0 to 80% took as long as from 80% to 90%. The challenges they faced:
    • Accuracy/latency tradeoff: more planning/self-correction = more nodes = higher latency
    • Tool calling: hard for agents to differentiate similar tools
    • Hard for tonal requests (e.g. "talk like a luxury brand concierge") in the system prompt to be perfectly obeyed
    • Hard for the agent to completely understand customers’ intent
    • Hard to create a specific set of unit tests because the combination of queries is basically infinite

    Thanks Jason Tjahjono for sharing this.

  3. In the paper UltraChat, Ding et al. (2023) shared that “the journey from 0 to 60 is easy, whereas progressing from 60 to 100 becomes exceedingly challenging.”

This is perhaps one of the first painful lessons anyone who has built an AI product quickly learns. It’s easy to build a demo, but hard to build a product. Other than the issues of hallucinations, latency, latency/accuracy tradeoff, tool use, prompting, testing, … as mentioned, teams also run into issues, such as:

  1. Reliability from the API providers. A team told me that 10% of their API calls timed out. Or product’s behaviors change because the underlying model changes.
  2. Compliance, e.g. around AI output copyrights, data access/sharing, user privacy, security risks from retrieval/caching systems, and ambiguity around training data lineage.
  3. Safety, e.g. bad actors abusing your product, your product generates insensitive or offensive responses.

When planning a product’s milestones and resources, make sure to take into account these potential roadblocks. A friend calls this “being cautiously optimistic”. However, remember that many cool demos don’t lead to wonderful products.

5. Forgo human evaluation

To automatically evaluate AI applications, many teams opt for the AI-as-a-judge (also called LLM-as-a-judge) approach — using AI models to evaluate AI outputs. A common pitfall is forgoing human evaluation to rely entirely on AI judges.

While AI judges can be very useful, they aren’t deterministic. The quality of a judge depends on the underlying judge’s model, the judge’s prompt, and the use case. If the AI judge isn’t developed properly, it can give misleading evaluations about your application’s performance. AI judges must be evaluated and iterated over time, just like all other AI applications.

Agent pattern

The teams with the best products I’ve seen all have human evaluation to supplement their automated evaluation. Every day, they have human experts evaluate a subset of their application’s outputs, which can be anywhere from 30 - 1000 examples.

Daily manual evaluation serves 3 purposes:

  1. Correlate human judgments with AI judgments. If the score by human evaluators is decreasing but the score by AI judges is increasing, you might want to investigate your AI judges.
  2. Gain a better understanding of how users use your application, which can give you ideas to improve your application.
  3. Detect patterns and changes in users’ behaviors, using your knowledge of current events, that automated data exploration might miss.

The reliability of human evaluation also depends on well-crafted annotation guidelines. These annotation guidelines can help improve the model’s instructions (if a human has a hard time following the instructions, the model will, too). It can also be reused to create finetuning data later if you choose to finetune.

In every project I’ve worked on, staring at data for just 15 minutes usually gives me some insight that could save me hours of headaches. Greg Brockman tweeted: “Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning.

6. Crowdsource use cases

This is a mistake I saw in the early days when enterprises were in a frenzy to adopt generative AI. Unable to come up with a strategy for what use cases to focus on, many tech executives crowdsourced ideas from the whole company. “We hire smart people. Let them tell us what to do.” They then try to implement these ideas one by one.

And that’s how we ended up with a million text-to-SQL models, a million Slack bots, and a billion code plugins.

While it’s indeed a good idea to listen to the smart people that you hire, individuals might be biased toward the problems that immediately affect their day-to-day work instead of problems that might bring the highest returns on investment. Without an overall strategy that considers the big picture, it’s easy to get sidetracked into a series of small, low-impact applications and come to the wrong conclusion that gen AI has no ROI.

Summary

In short, here are the common AI engineering pitfalls:

  1. Use generative AI when you don’t need generative AI
    Gen AI isn’t a one-size-fits-all solution to all problems. Many problems don’t even need AI.

  2. Confuse ‘bad product’ with ‘bad AI’
    For many AI product, AI is the easy part, product is the hard part.

  3. Start too complex
    While fancy new frameworks and finetuning can be useful for many projects, they shouldn’t be your first course of action.

  4. Over-index on early success
    Initial success can be misleading. Going from demo-ready to production-ready can take much longer than getting to the first demo.

  5. Forgo human evaluation
    AI judges should be validated and correlated with systematic human evaluation.

  6. Crowdsource use cases
    Have a big-picture strategy to maximize return on investment.

Read the whole story
peior
23 days ago
reply
Share this story
Delete
Next Page of Stories