What are AI agents, really?
From https://x.com/Suhail/status/1714019609224044654
In the last post, I wrote about AI agents as product managers but realized that as it's a very new term, it might not yet be clear to everyone what it means. Let's establish the term here.
"AI Agents" are causing a lot of hum and buzz in the AI ecosystem. AutoGPT is the fastest-growing GitHub repository ever in terms of stars, having crossed the 150k stars by now.
Additionally, it seems like almost daily, new agent projects are being created. See XAgent, AutoGPT, BabyAGI, AutoGen, …
But what are AI agents?
AI agents, or intelligent agents, are computer programs or systems designed to perceive their environment, make decisions, and take actions autonomously in order to achieve goals. They can be simple or complex, and examples include a thermostat, a human being, or any system that meets the definition. AI agents are often described schematically as an abstract functional system similar to a computer program. They can be autonomous, meaning they are designed to function in the absence of human intervention.
So why not just call them assistants? Thinking of an agent as an assistant is a pretty good approximation and will probably suffice for most people. However, agents can also act outside of a human assistant use-case, where the agent might just be part of a larger system.
For that, let's take a look at Langchain’s definition of an agent:
The core idea of agents is to use an LLM to choose a sequence of actions to take. In chains, a sequence of actions is hardcoded (in code). In agents, a language model is used as a reasoning engine to determine which actions to take and in which order.
What even is a chain? Let's say we have many company contracts in Google Drive stored as PDFs. We try to look something up in these contracts, but doing that manually is very cumbersome. If we would have a search engine to search in all those documents with natural language, that would be amazing... This use-case we've just described is the "Hello World" application of LLMs and a typical starting point for many people. How would we implement such a system? With a program that executes a (chained) sequence of steps:
Read all PDFs from the folder
Parse all PDFs and extract the text (there are libraries in all major languages doing this already)
For each PDF document, chunk it up in smaller pieces, such as about 200 words. We usually chunk by tokens, but can for simplicity assume that a chunk is roughly one token. A token is just word in the vocalubary of the language model
There are some language models specialized on turning text into vectors, also called embeddings. We now take the text chunk, tokenize it, meaning turning the text into a bunch of numbers. Those numbers we now put into the language model, which gives us again a bunch of numbers, which we call an embedding vector. That embedding vector usually has 512 - 1536 dimensions.
Now we store the small chunks together with the vectors related to the chunks in a database that is optimized for fast similarity search of these vectors.
We're done with building our index, now we can ask a question and answer it.
Let's say the question is, "What's the PTO policy in the contracts from last year?". What we do now is repeat step 4 for this question - get the embedding vector for this input question. With that vector, we now ask the database to give us the nearest neighbor vectors with the smallest distance to this vector in the vector space. Usually, the euclidean distance is used for this.
we now take the text chunks and concatenate them for those neighbors. With those and the request query, we now send an input question of this form to the LLM:
This is the context:
===========================
all our text chunks from the pdfs, with potentially thousands of tokens
===========================
Based on the context above, answer the query:
What's the PTO policy in the contracts from last year?
9. In other words, we're gathering a bunch of context that we "stuff into" the LLM prompt and we hope that based on this context the LLM can give us a great answer.
This is called RAG. Retrieval Augmented Generation. Libraries like LlamaIndex and LangChain exist to make this pattern accessible.
Cool. How do agents now come into play?
Let's say we're not just gathering PDFs but trying to extract data from a relational database such as MySQL.
And let's say that when we try that, we can't correctly extract the data from the database.
But what's the issue?
Is it the database connection?
Has the table we're looking at maybe been renamed? Has the column been renamed?
While we can have a chain trying to get data from a database with a predefined order of steps, this problem can be solved with an agent. An agent would be a system in which we let the LLM decide which steps to take to extract the data, based on the latest feedback it gets from its environment, in this case the database.
The agent might look at the database error and realize that the table has been renamed. It tries another query and succeeds. In other words, we can use an LLM to debug a flow. So an LLM is basically the "brain" of an agent.
There are many more examples of agents, but as this post already got longer than anticipated, for now I'll leave it at that. I hope this helped a bit to understand the different concepts.
Coming back to the question - are agents a fad? Is it hype?
In any new discipline, new terms need to be established. I think it's normal that not everyone has the exact definition for an agent. Establishing a clear term will take us a while, but I'm sure we'll get there.
However, I also believe that, yes, there is a lot of hype right now, and it's essential to take a step back, and reflecting on the problems we want to solve. Agents are a cool "what" - but what's the "why"?