Look Alive! (Part 2)

On AI agents and washing dishes

newsletter

artificial intelligence

ai agents

opinion

automato

Dishwasher as analogy for AI agents technology - Comparing automated tasks to mechanical hands and questioning current AI terminology

Soap Opera

Let’s do some time travel.

Unfortunately, I can’t provide a definitive year (historians are still locked in debate), but we’re going back to exactly one year before the household dishwasher was invented.

…

You’re in your suburban home the day after your birthday party. The party was a hit—the real talk of the cul-de-sac—but what goes up must come down, so now you’re faced with the aftermath: a mountain of dirty dishes.

You get to work. As the afternoon rolls on and fatigue sets in, you start to think there must be a better way. Cleaning dishes is so repetitive, so mindless, so menial, no person should be subject to such tedium. Tasks like this should be relegated to some sort of non-intelligent thing, some sort of dishwashing machine.

Aha! A dishwashing machine! What a wonderful idea. One for every kitchen. You’ll make millions!

Splash, scrub.

Now for your design.

Okay, dishes come in many shapes, sizes and materials, ranging from flat, saucer-like ceramics to concave glass cups. It’s clear that the machine will need to be flexible to accommodate the many dish types that it will encounter.

Let’s write it down. Requirement 1: the machine must have finger-like dexterity.

Right. The next thing to consider is that dishes can’t get clean without soap. Soap comes in bottles, so the machine will need the ability to get soap out of the bottle.

Got it – requirement 2: the machine will need the ability to squeeze things.

Also what good is soap if you can’t wash it off? Cleaning is just as much about removing soap as it is about applying it, so we’ll need to add water somehow.

Makes sense – requirement 3: the machine will need the ability to turn on a water source.

Okay, so we need dexterity, the ability to squeeze, and the ability to rotate. Yes! It’s so obvious: we need to build a mechanical pair of human hands.

…

It would have been very hard to picture the modern dishwasher before it existed. The natural thought would have been to find a way to make mechanical versions of the existing process. Sometimes I wonder if the same thing is happening today, but in reverse: we’ve created something new, labeled it “Human,” and now we’re stuck trying to attach it to our sinks.

Recap

Last week, I talked about how recently, many AI industry leaders seem to have aligned around the term “AI agents.” I posed the following potential problem with this: because lots of people are excited to innovate with large language models, a group of industry leaders assigning a very metaphorical, very graspable, very human label to a still unproven outcome of this technology might be causing a self-reinforcing cycle of startups trying to build AI agent companies, which is not the same thing as “AI systems with agency” being the natural next step with LLMs.

It feels silly to have to say this as the founder of an applied LLM startup, but I genuinely do believe in the usefulness of this technology. My qualm with the term is just that I think it might be leading talented people to the wrong conclusions. Look no further than the excitement around computer-using agents as an example.

If you haven’t seen computer using agents yet, the idea is very simple: they are a type of AI agent being designed to autonomously pilot computers. You basically give them a task, such as “Can you order me a couple burritos on DoorDash and finance them with Klarna?” and then lean back in your chair and watch your computer get to work.

The way that this works under the hood is actually simpler than you may think (obviously this is speculation as I don’t work at OpenAI or Anthropic). I’ll describe it below.

When web application companies build new features for their apps, an essential step to take before deploying the new feature to the live system is to thoroughly test the features out, so that they can be sure the updated code won’t break anything in production.

This is true of all software, but because web apps are highly dependent on typing and clicking within a web browser, web testing tools like Playwright, which allow you to control a web browser programmatically (i.e. simulating mouse clicks, typing, etc.), have emerged over time in the web dev space.

Because multimodal language models can take images as inputs, you can basically put 1) screenshots of a website and 2) web browser automation tools together and get something that “Sees” and “Clicks” and as a result can navigate the internet on your behalf.

This is extremely clever and might stick. I can’t say with confidence that this won’t work in the long term, especially when it comes to navigating legacy UI systems that haven’t changed in decades.

However, I can’t help but think that this is an extremely indirect way to put LLMs to use, and that it’s only exciting because it shows “Agency” (watch it make decisions!) instead of because of the tremendous amount of value that UI-using agents are about to unlock.

Data-Based

When your web browser goes to a website, it (typically) makes an HTTP request to a remote server and gets a response back. That response can be many things, a common one being a big string of HTML, which is a programming language that your web browser uses to understand what it should display to the user. That is a big part of your browser’s job: converting HTML, which is basically a description (in code) of what a web page should look like, into the web page that a user sees.

If we zoom out a bit on this process, you can see that all that is happening is an exchange of data: you type in a URL, the browser sends a request to that URL, and data is returned. Now, because we humans do better with things that look like our physical spaces, we need this data to be transformed (by wrapping it in HTML and CSS) into things that have nice, physical layouts, which has led to an industry of UI & UX designers that specialize in creating human-navigable web pages.

Large language models don’t have the same constraints. As can be seen by their popularity with programmers, they’re perfectly happy when working directly with code and data, without any need for graphical user interfaces. Why make LLMs navigate systems as if they experience the world like we do? Is it justified, or is it just to show agency?

Consider the dishwasher.

See you next week!

Look Alive! (Part 2)

Subscribe to Our Newsletter 🍅

Soap Opera

Recap

Data-Based

Create Presentations in Minutes, Not Hours