Semantic Caching

Nov 10, 2023

Imagine you're operating an application using LLMs. If you're successful, the LLM bill can go high very quickly.

Some of the sources for a high LLM bill are:

Repeated regression tests. Although prompts didn't change, a naive implementation would send the same requests over and over to the LLM.
Prompt tuning: You might want to try out new prompts with existing test data, potentially having hundreds of tests per iteration that you test.
Complex workflows: Most chatbot applications don't just directly talk to the LLM, but involve external APIs, external datasources to fulfill the query.
Maxing out the context: Depending on the application, you might need to use the context window fully. With GPT-4 (non-turbo), passing in 30,000 tokens (which might just be one long document) costs $1.80.

A naive approach to solve this is using a traditional key-value cache. You take the input that you'd send to OpenAI, be it a completion or a whole conversation to a given point and do an exact string match against it. This especially helps for regression tests, which run the same prompt over and over and can both speed the tests up and save costs. However, what if we have the following prompts:

"How to get started with Effect"
"Effect getting started"
"Getting started with Effect"

They all denote the exact same intent. I ran them all through one of the state-of-the-art embedding models, bge-large-v1.5, and got these results:

Cosine Similarity between 1 and 1: 1

Cosine Similarity between 1 and 2: 0.9178054403912372

Cosine Similarity between 1 and 3: 0.981772531149283

Cosine Similarity between 2 and 2: 1

Cosine Similarity between 2 and 3: 0.9321289492871373

Cosine Similarity between 3 and 3: 1

Although it’s the exact same intent, it’s surprising to see the gap in cosine similarity between 1 and 2. I was wondering - how we could “nudge” the embeddings to be more similar, as what we care about is if they have the same intent. Instead of directly embedding the words as they are, I instead took the following variations:

The intent of saying "How to get started with Effect"
The intent of saying "Effect getting started"
The intent of saying "Getting started with Effect"

which yields these results:

Similarity between 1 and 1: 1

Similarity between 1 and 2: 0.9351850391263539

Similarity between 1 and 3: 0.9798117351520749

Similarity between 2 and 2: 1

Similarity between 2 and 3: 0.9662336980291407

Similarity between 3 and 3: 1

While the distance between 1 & 3 is slightly lower, now 1 & 2 moved together. I can also imagine specialized embeddings fine-tuned for this specific use-case of semantic caching.

Instead of simply doing a K/V lookup, we can get the embedding for the input vector first, and do a lookup against a vector store. If the similarity is above a certain predefined threshold, return the result — we have a cache hit.

As we see more applications of the Voyager approach, such a fuzzy or semantic cache can also be used as a way to look up skills. Is this a situation we've seen, just with very little difference? Let's take the tools out of our tool library we already created before to deal with this situation.

Achraf Ait Sidi Hammou

Feb 7, 2024

Love it!

I’ve been thinking about this a lot: error handling, testing, caching, observability, … are going to be required in AI applications and we’ve already (kinda) figured it out for the web, so how can we transfer knowledge?

The new shift in Software 2.0* is uncertainty and that brings a whole new set of considerations.

So many interesting challenges to work on!

*: https://karpathy.medium.com/software-2-0-a64152b37c35

Expand full comment

Tim Suchanek

Discussion about this post