The Agent-API Bottleneck
Jan 23, 2024
LLM-based agents bring the promise of the next step-change in the usefulness of generative AI on the road to an AGI (Artificial General Intelligence). Instead of giving AI a task such as “summarise this text” you could give it a goal e.g. “publish a novel about a time-travelling detective" and it could manage the whole process from research through writing, publishing and even marketing with minimal human intervention.
At a high level an agent will make this possible by:
Reasoning about a problem and planning the steps needed to solve it
Interacting with the external environment (using tools such as APIs) resulting in real-world outcomes (an ebook being uploaded to Amazon)
It’s likely that point 1 will require either: some new learning paradigm applied to a pre-trained LLM - perhaps an RLHF variant specifically for an agent, a new pre-training procedure that would do away with autoregressive token prediction or an entirely new architecture (selective state space models are making waves). OpenAI's rumoured Q* model is an example of progress in this area which, according to reports, is able to reason about maths problems.
Point 2, however, is currently being tackled by the open-source and startup community. The prevailing approach involves training an LLM to interface with traditional APIs and to understand when these APIs might be appropriate - tools such as Zapier are relevant here. This will definitely unlock a lot quickly using legacy systems, however we see the predefined / static nature of APIs as a bottleneck to the usefulness of an agent.
We’re in this paradox where LLMs possess vast potential and adaptability, yet they are constrained by the rigidity of tools / APIs.
The alternative: Go native
We think the way to overcome this is to focus on integrating AI native tools instead of trying to retrofit AI onto existing systems. To do this the output space of your LLM and input space of your tool should be the same - natural language. What’s more, this also opens up the potential for the LLM to learn new ways to use the tool over time through self-experimentation of structured prompting.
A widely used example of this is Retrieval Augmented Generation (RAG) which differs from the traditional API approach, which might involve a “GET” request to retrieve specific information. Instead the RAG approach combines LLMs with vector databases and embedding models, providing the ability for the AI to directly interface with the ground truth data (a collection of PDFs for example) in natural language, without having to go through a constrictive API first. One could imagine that “POST” requests could be replaced by disposable code written and executed in the same environment as the API.
But this idea goes far deeper than simply searching through text. With the right multimodal models (such as CLIP or Gemini) you can extend an LLM to have deep semantic knowledge of any data and we think these will form the “tools” of the future, rather than infinitely extending APIs to meet the demands of LLMs.
At Moonsift our AI agent has state of the art product understanding. This includes images of the products as well as descriptions. This allows our users to search for characteristics that were never explicitly mentioned (or perhaps even thought of) by the retailer.
For example, “A dress that looks like a Mojito”
A real life example from Moonsift’s shopping copilot, currently in beta testing and releasing in 2024.
Our initial co-pilot allows users to converse with products and refine their choices, providing a wealth of data to train our agent to become more autonomous. We envision Moonsift’s agents reaching a stage where they know your taste better than anyone, a world where someone who doesn’t know what gift to get you can simply ask your agent what to get you for a particular occasion. Or perhaps – if you’re a stylish individual – imagine renting out your agent to others to shop for them.