Positive Reinforcement

OpenAI's shift to specialized model optimization changes the future of AI agent development

automato

newsletter

artificial intelligence

ai agents

deep research agent

machine learning

AI model specialization techniques for developing task-specific agents through reinforcement learning

A little more than a month ago, I wrote a piece about OpenAI’s Deep Research Agent. In that essay, I noted that when announcing Deep Research Agent, OpenAI wrote that the product was “powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis.”

This sentence piqued my interest, which led to the following comment in my essay:

“What we can glean from this is that…maybe OpenAI feels like it no longer needs to push the frontiers of its flagship models to achieve artificial general intelligence and is instead shifting to building infrastructure to more easily connect their models with the outside world.”

Shortly thereafter, an interview (released the same day as my post!) between Sequoia Capital and some of the researchers behind Deep Research Agent both confirmed and expanded upon my hypothesis.

Though the video doesn’t provide an overwhelming amount of detail, they do give us several interesting nuggets:

· [18:42 – 19:16] In machine learning, you get what you optimize for. Therefore, optimizing a model for a specific task (via reinforcement learning) is where the best agents are going to come from

· [19:24 – 20:00] One of the hidden keys to success with Deep Research Agent was tuning it on high-quality datasets

· [23:48 – 25:12] The “recipe” that was used to build Deep Research Agent will scale to a large number of use cases, to the point where artificial general intelligence has shifted to an “operational problem” now

In short, I was right (share this post with a friend who likes high-quality insights) that the team at OpenAI feels like their current suite of models is good enough to get us to AGI. I was also right (stack ‘em up!) that the Deep Research Agent was a test of this hypothesis. My assumption about exactly what they did was slightly off (who cares), but either way, it seems like a big shift is happening within the walls of our favorite American AI lab.

Is it all ov3r?

In order to grok what the OpenAI researchers are saying in this video, it’s important to first remember that OpenAI’s major customer-facing products (leaving out the biggest one of all: ChatGPT) are a suite of AI models that the company has created (trained) over time. You can see a list of all the models here. OpenAI has spent billions of dollars training these state-of-the-art models, which they make available to application developers like me for a fee.

In this video, the OpenAI team clearly states (not necessarily a bad thing–I’m just noting that they didn’t mince words) that these “core” models (i.e. the ones that any of us can use in our apps, often referred to as “foundational” models) are not likely to be good enough primitives on which to build the best AI agents of the future. Instead, the team suggests that it will be necessary to take these state-of-the-art models and mold them, using a technique called “reinforcement learning,” for more specific tasks, such as researching the internet (as we see with the Deep Research Agent).

A metaphor might help here. Imagine a young athlete, about to enter high school, that shows promise in three different sports. On one path forward, the young athlete stays balanced, continuing to play all three sports but capping their potential in each. On another path, the young athlete decides to specialize, committing to year-round focus on one sport and sacrificing the others, but with the potential to reach higher levels of skill than would have otherwise been possible.

In this example, the Deep Research Agent falls into the latter category. It sounds like OpenAI took a core model (the three sport athlete) and decided to specialize it, training it on a dataset (which they mention a few times throughout the video) that was optimized for being really good at doing web-based research. Again, as stated in the video, it sounds like we should expect to see them follow this recipe more often in the future, as they believe that this is the recipe for making agents that reliably work.

This has pretty significant implications for anyone who is building products, especially highly autonomous AI agents, based on the assumption that your products will get better as core, general-purpose models improve. The OpenAI team very plainly called BS on this, stating that fine-tuning the models using reinforcement learning looks like the clear path to making genuinely useful agents.

What is really interesting about this is that OpenAI doesn’t actually allow people outside of OpenAI to fine-tune their latest and greatest models. They do allow people to fine-tune some of the older models (gpt-4o was so ’24), but no one outside of OpenAI would be able to build something using the “recipe” that the OpenAI researchers detailed in this video, because no one outside of OpenAI has access to the fine-tuning option for their o3 reasoning model.

This sounds like a good business strategy! I’m not saying it’s not a smart move for them. If I had a world class AI research team that had developed a technique for making genuinely useful autonomous agents, I probably wouldn’t be in a hurry to expose the secret sauce to the world either. All I’m saying is that you ought to pay this some close attention if you’re operating under the assumption that your agent is just one great OpenAI model away from being ready for production.

The good news is that 1) this might just be a short-term delay and 2) people in the open-source community have started to make some interesting discoveries as of late. Looking forward to checking it out with you next week.

See you then!

Positive Reinforcement

Subscribe to Our Newsletter 🍅

Create Presentations in Minutes, Not Hours