Tim Suchanek

Building on weekends

Tim Suchanek — Mon, 13 Nov 2023 02:47:45 GMT

“What the smartest people do on the weekends is what everyone else will do during the week in ten years.”

A good friend of mine, Marc Mengler, just shared this fantastic quote with me. It blew me away, because it’s so true. A variation of this is “What people in SF build on the weekends will be used by everyone else during the week”. This is so true for many developer tools, hacking projects and generally startup ideas.

Given that SF is so back given the developments in AI, I’m curious what people here, but also curious minds around the world will hack on, which will influence the lives of pretty much everyone else.

This is a short and sweet post, and will be the last of my daily posts for a while. This is post number 48. Having written daily for a while now was an amazing experience, I’ll write about what I learned another day.

However, I’ll take a break from writing daily, as I’ve decided to work on a deep implementation effort, that will need my full attention. I’ll share soon what it is about.

No worries, I’ll get back to writing, just not now.

Thanks for following my posts so far and see you soon!

Unpacking Thoughts

Tim Suchanek — Sat, 11 Nov 2023 16:55:49 GMT

While it can be dangerous to anthropomorphize LLMs too much, as we don’t want to end up like in the movie Her, LLMs might have more similarities with us than we think.

In working with LLMs, there is a powerful technique called “chain-of-thought prompting”. The idea is, that instead of telling the LLM to do something for us, we give it an objective and “talk out loud”, how it would go about reasoning about it. It breaks things down to smaller steps which are a chain of logical steps it follows. It turns out, that when prompting an LLM with this technique, the reasoning capabilities are improving. Just using this technique, researchers were able to reach state-of-the-art performance with LLMs in various assessments in arithmetic, commensense and symbolic reasoning tasks. In different words - we’re forcing the LLM to use more tokens to “think”. It’s in a sense speaking out loud while trying to solve a task.

And the same approach works for us humans. There are several approaches where using more words to describe something helps us to think:

Journaling helps us to deal with difficult emotions, but also clarify our thinking
Rubber Duck Debugging is a technique used in software development, where you describe a problem to a rubber duck, and just by speaking it out loud, you enable yourself to get distance from the problem and break it down to its essential parts to be able to solve it
Psychotherapy — any form of cognitive therapy involves the patient talking a lot, which helps to uncover trauma, hidden thought patterns, and get perspective
Just talking to friends can be very powerful, and hearing back from them what they think - stating the situation from a different view

Quoting Paul Graham:

Once you publish something, the convention is that whatever you wrote was what you thought before you wrote it. These were your ideas, and now you've expressed them. But you know this isn't true. You know that putting your ideas into words changed them. And not just the ideas you published. Presumably there were others that turned out to be too broken to fix, and those you discarded instead.

From http://www.paulgraham.com/words.html

Talking thoughts out loud can be a powerful tool, not just to increase our reasoning power, but also the ones of LLMs.

Semantic Caching

Tim Suchanek — Fri, 10 Nov 2023 15:23:45 GMT

Imagine you're operating an application using LLMs. If you're successful, the LLM bill can go high very quickly.

Some of the sources for a high LLM bill are:

Repeated regression tests. Although prompts didn't change, a naive implementation would send the same requests over and over to the LLM.
Prompt tuning: You might want to try out new prompts with existing test data, potentially having hundreds of tests per iteration that you test.
Complex workflows: Most chatbot applications don't just directly talk to the LLM, but involve external APIs, external datasources to fulfill the query.
Maxing out the context: Depending on the application, you might need to use the context window fully. With GPT-4 (non-turbo), passing in 30,000 tokens (which might just be one long document) costs $1.80.

A naive approach to solve this is using a traditional key-value cache. You take the input that you'd send to OpenAI, be it a completion or a whole conversation to a given point and do an exact string match against it. This especially helps for regression tests, which run the same prompt over and over and can both speed the tests up and save costs. However, what if we have the following prompts:

"How to get started with Effect"
"Effect getting started"
"Getting started with Effect"

They all denote the exact same intent. I ran them all through one of the state-of-the-art embedding models, bge-large-v1.5, and got these results:

Cosine Similarity between 1 and 1: 1

Cosine Similarity between 1 and 2: 0.9178054403912372

Cosine Similarity between 1 and 3: 0.981772531149283

Cosine Similarity between 2 and 2: 1

Cosine Similarity between 2 and 3: 0.9321289492871373

Cosine Similarity between 3 and 3: 1

Although it’s the exact same intent, it’s surprising to see the gap in cosine similarity between 1 and 2. I was wondering - how we could “nudge” the embeddings to be more similar, as what we care about is if they have the same intent. Instead of directly embedding the words as they are, I instead took the following variations:

The intent of saying "How to get started with Effect"
The intent of saying "Effect getting started"
The intent of saying "Getting started with Effect"

which yields these results:

Similarity between 1 and 1: 1

Similarity between 1 and 2: 0.9351850391263539

Similarity between 1 and 3: 0.9798117351520749

Similarity between 2 and 2: 1

Similarity between 2 and 3: 0.9662336980291407

Similarity between 3 and 3: 1

While the distance between 1 & 3 is slightly lower, now 1 & 2 moved together. I can also imagine specialized embeddings fine-tuned for this specific use-case of semantic caching.

Instead of simply doing a K/V lookup, we can get the embedding for the input vector first, and do a lookup against a vector store. If the similarity is above a certain predefined threshold, return the result — we have a cache hit.

As we see more applications of the Voyager approach, such a fuzzy or semantic cache can also be used as a way to look up skills. Is this a situation we've seen, just with very little difference? Let's take the tools out of our tool library we already created before to deal with this situation.

Framing

Tim Suchanek — Thu, 09 Nov 2023 15:53:51 GMT

In old-school economics, it was assumed that humans are rational actors, acting based on selfish motives, and that by understanding this, we would be able to predict economic outcomes. However, this mentality baffled Daniel Kahneman, who together with Amos Tversky researched this topic. Kahneman, who from years in
psychology knew that, in his words,

“[I]t is self-evident that
people are neither fully rational nor completely selfish, and
that their tastes are anything but stable.”

Through decades of research with Tversky, Kahneman demonstrated that all humans suffer from cognitive biases—unconscious, irrational brain processes that distort how we see the world.

Kahneman and Tversky discovered more than 150.

The Framing Effect is one of them, which demonstrates that people respond differently to the same choice depending on how it is framed (people place greater value on moving from 90 percent to 100 percent—high probability to certainty—than from 45 percent to 55 percent, even though they’re both ten percentage points).

This made me think of the latest framing around OpenAI’s announced GPT Store, which will be the app store to build your chatbots, which you can publish and monetize. The part that I found particularly interesting was the framing of “Revenue Share”. That’s a very interesting framing, almost a euphemism for nothing else than “Platform fees.”

Apple charges 30% of all revenue coming from iOS apps in fees. It is, after all, a fee — people pay money to use the creation of someone who sat down and spent time creating it. Don’t get me wrong - without Apple and in this case, without OpenAI it wouldn’t be possible for anyone to even create something like this. They make the platform. It’s interesting to see how the framing here is focused on the positive part - “Revenue” - yes, people want revenue, and currently, being in SF, the epicenter of AI developments, I can feel the gold rush atmosphere. There’s almost a fever going on (for sure non-zero cognitive biases playing a role), people frantically trying to keep up with OpenAI and the opportunities that open up. And this little word “share”, which denotes the existence of a fee, is so innocent on the side. The messaging seems to work. People are excited, and in the last 48 hours, authors created chat bots to chat with their books. Some people try to make their personal assistants. Others replace their hand-crafted RAG system with a custom GPT. Let’s see how much revenue OpenAI will share, but I wouldn’t be surprised if it’s 70%, as with Apple.

Closing the loop

Tim Suchanek — Wed, 08 Nov 2023 15:18:59 GMT

I recently asked a friend working at OpenAI what kind of startup he'd found if he were to build one today. His answer was simple: I'd solve a specific problem where I can collect training data from users, with which I can fine-tune a model on top of OpenAI, which increases the quality of the product, which brings in more users, which gets you more training data, which lets you further fine-tune the model, etc.

Midjourney is a staggering example of this. Its quality is still way superior to Dalle-3. How is that possible? One thing is that they're focused but also able to constantly increase the quality by getting user feedback.

https://twitter.com/DrJimFan/status/1643279641065713665

By choosing which picture to upscale, you give Midjourney feedback on which option you prefer, from where they can further train their models.

How would such a loop look like for LLM-based applications? What's the journey from starting with a prompt to a closed loop, which can even tune itself?

Create a prompt, by using one of the prompt playgrounds out there, such as baserun.ai's and test it with a few test cases that you believe users will have
Deploy the prompt to production, get the first users
With the first users providing actual input, collect that input and turn it into evaluation data
With this evaluation data, you can now do a couple of things:
1. Further improve the prompt
2. Fine-tune a model
3. Adjust your RAG approach (reordering etc)
Deploy the changes
Get more user feedback, which adds more evaluation data
Profit

This is one of the ways to build a moat. By constantly improving on a hard problem after a few iteration cycles, you can already be miles ahead of where you started. What if there was a tool that could do this tuning for you automatically, even deploying the changes? We'll soon see this tooling, and I'm excited to see applications improve and become more useful for us humans!

One of the most powerful concepts in AI agents: Voyager

Tim Suchanek — Tue, 07 Nov 2023 14:46:41 GMT

Not even half a year ago, a revolutionary paper was released in the field of autonomous agents research that, in my opinion, hasn’t gotten enough attention given the magnitude of impact it will have. I’m talking about VOYAGER: An Open-Ended Embodied Agent with Large Language Models.

The authors from NVIDIA, Caltech, UT Austin, Stanford, UW Madison, including Jim Fan, have built an agent that can teach itself to play Minecraft. How does it work?

Explore the world, and whenever you find yourself in a new situation, reason about the situation and come up with an idea of what tool to build or use.
Try to use or build the tool. At this point, they have an iterative algorithm, which tries over and over, with the agent self-verifying if the tool was successfully created or the action successfully performed. Example for this: Reasoning: Since you have a wooden pickaxe and some stones, it would be beneficial to upgrade your pickaxe to a stone pickaxe for better efficiency. Task: Craft 1 stone pickaxe.
Once the agent verifies that it successfully performed the task, it will save it in a skill library.
The next time the agent faces a similar situation, it can take the skill from the library instead of having to create it again.

It uses GPT-4 as its reasoning engine. With this simple mechanism, Voyager was able to beat the state of the art by orders of magnitude. It allows the agent to explore the world in a self-supervised fashion without the need for human intervention.

Now the obvious question is - where else can one apply the Voyager approach? Robotics, Healthcare, Education, Environmental Tracking, Smart Home assistants - the possibilities are endless.

For a good reason, this paper won the NeurIPS Outstanding Paper Award.

OpenAI's T+1 expansions

Tim Suchanek — Tue, 07 Nov 2023 03:28:31 GMT

Any product idea that you’ll ever see is always coming from a T+1 expansion. Given the world today, including its problems and solutions that already exist, what can a better solution look like? We humans always combine existing things. Nobody can just "skip" a step. However, when doing the iteration quickly, you can have impressive progress in a short amount of time that from outside is often hard to follow and looks like a big jump.

So is the progress of OpenAI — it’s a chain of T+1 expansions, and by being incredibly fast in bringing one expansion after another, a few years in - the territory that OpenAI covers is vast.

I'm imagining this like a game of Warcraft or Settlers. You have a fog of war, which is area that is dark, you can't see it and with that you also don't have conquered it.

Now, let's put ourselves into the shoes of a gen ai founder who just looked at OpenAI's offerings this year after GPT-4 came out and tried to find an opportunity, in other words, trying to find dark spots on the map, which OpenAI doesn’t control yet. Some of the obvious places that were still "dark spots" on OpenAI's map and how OpenAI just expanded on them today:

Customized models for enterprises that can run on-prem. Emad Mostaque, CEO of StabilityAI observed this and wanted to collaborate with companies all around the world, counter-positioning OpenAI with decentralized models. This is now covered by custom models you can train with OpenAI for $2-3M. Yes it's a lot of money, but for the value these models will provide, OpenAI understands that this price is probably even cheap.
Saving costs: Finding a way to get the same quality of service for less money. OpenPipe is one example - training smaller models based on GPT-4 responses to save costs. Some use-cases here are now less urgent, as GPT-4 Turbo has a 3x cost reduction.
Chat with your data: OpenAI just released a whole SDK what makes it easy to chat with your data. You can now upload any file (they're supporting most mime types) which OpenAI will turn into a useful chat bot for you. You can connect any internal functions and call them with the same assistant. While this is far from an autonomous agent, it's covering the most popular use-case of LangChain and LlamaIndex: Talk to your documents.
Well-structued output: Getting a reliable JSON output from OpenAI was hard until recently, and you needed to use threatening language like "I'll take a life" - this is not necessary anymore and we can now use GPT-4 Turbo reliably as an API.
Long Context: I used Claude 2 quite a bit to chat with larger PDFs such as books. This is not needed anymore with Assistants but also GPT-4 Turbo 128k context in general.

Given, that OpenAI has now expanded further in all these directions today, what will come next? What is another T+1 step from here? It's clear that when building a startup today if the startup is within the T+1 reach of OpenAI for its next iteration cycle, there's a significant risk involved.

So, what are some ways to be protected against that? How can you build a startup today that's not prone to being "eaten" or "killed" as people like to say on ~~Twitter~~ X by the next round of OpenAI announcements?

I'll write about that in the coming days. For now, let's just acknowledge the speed at which OpenAI is shipping. This is truly inspiring.

Inevitable AI developments

Tim Suchanek — Mon, 06 Nov 2023 04:39:52 GMT

Three of the most impactful inevitable trends shaping the future are cognifying, tracking, and questioning.

Cognifying refers to making previously "dumb" objects and processes smarter through technology like AI. As more objects and systems become cognified, they will be able to operate with less human input and oversight. While this could lead to greater efficiency, it also raises concerns around transparency and control. How can we ensure cognified systems act ethically if we don't fully understand how they make decisions?

Tracking is also increasing rapidly with new technologies. While customized services and experiences can benefit individuals, invasive tracking threatens privacy. There is a difficult balance between utilizing data to improve lives versus respecting personal boundaries. The onus should be on companies and governments to collect only essential data and be fully transparent about how it is used.

Finally, questioning has become ubiquitous due to the overwhelming amount of contradictory information online. While healthy skepticism is positive, pervasive questioning of facts and expertise threatens to undermine institutions and spread misinformation. We must find ways to separate truth from fiction through critical thinking and responsible gatekeeping.

Overall, we must thoughtfully guide these technological changes to create an equitable and enlightened future society. With care, AI and related advances can enhance our lives without dehumanizing us in the process. As outlined in Kevin Kelly's book The Inevitable, these trends will shape the future whether we want them to or not, so it is up to us to direct them wisely.

The potential of GPT4-V

Tim Suchanek — Sat, 04 Nov 2023 15:46:57 GMT

As we head toward the OpenAI Dev Day on Monday, people are curious about what they’ll release. Some of the common predictions I’ve seen so far:

GPT4-Vision API
Stateful API
GPT4 fine-tuning
GPT4 turbo
Dalle3 API
Code interpreter API
Agent framework

I see GPT4-Vision API as the most likely option right now, as they already demoed the API at the AI Engineer Summit. While at the summit, OpenAI had a neat little demo around blog post generation — what can it unlock in the real world?

Some ideas on how GPT4-V could be used for good:

Build your own rewind.ai: Rewind AI’s interface to gather context about your life is through vision. It does regular screenshots and analyses the data. Through that, Rewind becomes very useful over time, being able to ask questions about anything we saw. While you can build this on your own with OCR, OCR wouldn’t allow you to ask semantic questions about an image, such as “What can we see here?” “What’s important in this screenshot given my context?”. While typical RAG applications vectorize text from websites, APIs, and databases and do similarity searches on it, it’ll be hard to have a fully comprehensive AI agent without tapping into the most general API of them all: Vision. There’s more to building your own rewind.ai, but the components to do so are being commoditized as we speak.
Improved browser automation: Ultimately, browser engines render a website to be consumed by a human, not a machine. While we can read HTML code with machines and try to make sense of it, at some point it’ll always hit limitations, such as z-index calculations. Being able to use text and image will allow folks to build powerful scrapers and browser automation moving forward.
Generative UI: When asking GPT4-V to implement a UI from a screenshot, it’s already surprisingly good at it. While the results might not be production-ready yet, it’s showing us the potential of the technology. Instead of starting at 0 as with v0.dev, what if you can provide your favorite website designs as a mood board to start with and let the AI do the rest?
Turn images into APIs: When having an image of groceries, GPT-4V was able to return perfect JSON that contains a list of items, including their calories. With GPT4-V, we’ll be able to turn any image or video into an API that can be programmatically accessed and used as a building block for a new frontier of apps. It’s obvious that Llamaindex and Langchain will also add support for this.

The potential of GPT4-V is endless. However, as with any new technology, it’s not yet perfect. A recent study found that OCR models still beat GPT4V, and a certain level of prompt injections is possible through image text. It’s also clear, that it’s not affordable to run GPT4-V over millions of images. A blockbuster has about 300k-600k frames. Turning a whole movie into an API might still be a bit far out. However, we also know that costs will always decrease, and models get faster.

What can one build today that’s just not yet possible but will be possible with this technology soon?

Signs of shortsightedness

Tim Suchanek — Sat, 04 Nov 2023 04:22:04 GMT

In our busy lives, we often find ourselves caught up in the “now,” losing sight of the bigger picture. This myopia can cause us to stray from our long-term goals.

Here are five signs of shortsightedness and how to correct them.

Unintended Consequences: Acting hastily can lead to unforeseen outcomes. Solution? Take a moment to think through the potential fallout before deciding. Assign a skeptic in your group to play devil’s advocate, analyzing the possible consequences of a decision.
Tactical Hell: Getting caught up in minor battles can distract you from your major goals. If you find yourself stuck in petty squabbles, step back, calm your ego, and refocus on your long-term objectives. Remember, actions speak louder than words.
Ticker Tape Fever: In the age of real-time updates, we often react impulsively. Slow down, take a breath, and give yourself time to process information. Like Abraham Lincoln, exercise patience to see the bigger picture and make better decisions.
Lost in Trivia: Overloading with too much information can muddle your thinking. Establish a hierarchy of priorities and focus on what truly matters. Delegate lesser tasks and avoid drowning in trivial details.
Fear of Missing Out (FOMO): The anxiety of not keeping up with trends or others' activities can lead to rushed, ill-considered decisions. Combat this by reminding yourself of your personal goals and values. It's okay to march to the beat of your own drum, prioritizing what's crucial for your long-term vision over fleeting trends.

By recognizing and addressing short-sighted thinking, you'll foster a clearer, broader perspective that will steer you closer to your long-term ambitions. With a balanced view and patient approach, you'll navigate through life's challenges more adeptly, making decisions that resonate well into the future.

Some of these come from Laws of Human Nature, which I highly recommend to read.

Do the best ideas win?

Tim Suchanek — Fri, 03 Nov 2023 04:41:27 GMT

How can it be that although Stoicism is becoming more popular today, Christianity has been the predominant belief system for so long? Wouldn't the best ideas always prevail?

Obviously not. There are a few reasons why Christianity gained dominance over Stoicism historically, even though Stoic ideas are resonating more in the modern era:

Christianity gained an early boost by piggybacking on the infrastructure of the vast Roman Empire. Once Constantine converted, Christianity could spread rapidly through imperial decrees and resources. This early infrastructure advantage was critical.
Christianity appealed more to the heart than the mind. Its focus on Christ's narrative and the promise of personal salvation had greater emotional draw than Stoicism's rational emphasis on virtue and discipline.
Christianity aligned better with human psychology. Its black and white moral code and threat of divine punishment played upon innate cognitive biases in a way the nuanced ethics of Stoicism did not.

In short, the dominance of Christianity historically does not necessarily mean it offered the "best ideas." It spread for institutional, emotional, and psychological reasons. The revival of Stoicism today shows that when ideas are considered on merit, the old Greek philosophy still has tremendous wisdom to offer the modern world.

Confirmation Bias

Tim Suchanek — Thu, 02 Nov 2023 03:39:50 GMT

Confirmation bias is like a pair of tinted glasses that colors the way we perceive the world around us. It's a psychological shortcut that often leads us astray, prioritizing information that aligns with our existing beliefs while ignoring the rest. This cognitive glitch can warp our understanding and decision-making. Here are three instances to paint a clearer picture:

The Cold Fusion Controversy (1989)

When Martin Fleischmann and Stanley Pons announced their discovery of cold fusion, it was like a ray of hope for a clean energy future. However, this hope made some overlook the inconsistencies in their data. Many, swayed by confirmation bias, claimed to replicate their results, choosing to ignore counter-evidence. The bubble burst when further rigorous testing failed to reproduce the same results, extinguishing the initial spark of excitement.

Political Echo Chamber

The political sphere often mirrors a large echo chamber where our beliefs bounce back at us, reinforced. A conservative might find solace in the familiar rhetoric of conservative news channels, while a liberal might resonate with the liberal ones. This selective tuning in, driven by confirmation bias, cocoons us from opposing viewpoints, reinforcing our existing beliefs while shielding us from any information that might challenge them.

Investment Decisions:

The financial markets are a rollercoaster ride, and confirmation bias can sometimes make us cling to the wrong stocks. An investor, swayed by a single positive forecast, might ignore the storm clouds of adverse market trends, holding onto declining stocks with a false sense of hope.

These examples of confirmation bias in action show how it can subtly yet significantly skew our perception and decisions. By acknowledging the presence of this bias, we can strive towards a more balanced and evidence-based approach to our judgments and decisions, navigating through life with a clearer lens.

One way to tackle confirmation bias is with a clear decision-making process.

AI's socratic journey

Tim Suchanek — Tue, 31 Oct 2023 13:51:45 GMT

Many years ago͏, the Greek philosop͏her Socrates was to͏ld by ͏the ͏Oracle of Delphi that he was the sma͏r͏test m͏an alive. Socrates wasn't so sure͏, so he͏ went on a journey to ͏talk to o͏ther wise people͏ in di͏fferent professions.

During his conversations, he ͏notic͏ed͏ tha͏t these so-called wise peop͏le often ha͏d͏ str͏o͏ng opinions about things they d͏id͏n't͏ know much about. They had͏n't ͏really ͏thought things through. This made ͏Socr͏ate͏s realize he was͏ wise becau͏se he kn͏ew wh͏at he didn't know.

͏Today w͏e have L͏a͏rge͏ L͏anguage Model (LLM). T͏hese are com͏puter programs ca͏pabl͏e of genera͏ting te͏xt that appears t͏o be wr͏itten by͏ a human. They h͏ave be͏en trained on ͏a lot o͏f ͏dat͏a, which allows them to talk about͏ a variety of topics.

Ho͏wever, lik͏e the w͏ise͏ men wi͏th whom S͏ocrates spok͏e͏, LLMs sometimes giv͏e an͏swer͏s that se͏em ͏c͏onfid͏e͏n͏t͏ but͏ a͏re in f͏act͏ wron͏g͏ or incomplete͏. Our ͏cha͏l͏le͏nge now is t͏o help LLMs under͏stand what they know and what they͏ don'͏t͏ kn͏ow. The͏y confa͏bulat͏e͏.͏

If we can do this, LLMs will b͏e much ͏more useful bec͏ause the͏y will ͏pro͏v͏ide us with better and͏ m͏ore reliable inf͏ormati͏on. In short͏, like Soc͏rates, LLMs must͏ lear͏n the limits of their͏ kn͏owle͏dge.

Any LLM that c͏an͏ do ͏th͏is will be t͏ruly w͏ise and͏ w͏ill stand out͏ f͏rom the rest. Remembering the word͏s o͏f Socrates͏: “I know th͏a͏t ͏I am intelligent, bec͏a͏use͏ I know ͏that I know nothing” wi͏ll be th͏e key to͏ ach͏ievi͏ng this.͏

The͏ LLM t͏hat ca͏n ͏und͏ers͏tand an͏d l͏ive by this phil͏osophy will͏ indeed be ͏the ͏w͏ises͏t.

Everything Diffusion

Tim Suchanek — Mon, 30 Oct 2023 14:42:33 GMT

While Transformers are seen as the most significant innovation in AI in the last years, powering the AI summer that we’re finding ourselves in, there’s another extremely powerful technology, which is discussed less: Diffusion.

One of the hardest and most essential parts in training neural networks are high-quality evaluations. By what metric should we train the model? Before stable diffusion, vision models would be trained on recognizing objects in images through bounding boxes, semantic segmentation, or mere instance classification. Researchers trying to create the best models measure themselves with the benchmarks that were available at the time, and the benchmarks available were not yet enabling to train models that are creating “beautiful art”.

Enter diffusion. The idea behind diffusion is quite simple. To create an evaluation function for the neural net we’re training here, we take an image as it is, apply progressively stronger random distortion, and now teach the neural net to go from a more distorted option to a less distorted version. This has been hooked up with the CLIP model, which allows vectors and text to be in the same latent space → now we can describe text and, with a multi-iteration approach, get to the point that we have a high-quality image output.

This idea of starting with something of lower quality and increasing its “resolution” or quality to train a neural net can’t just be applied to vision models. We’ll see this idea find application in all kinds of areas:

CodeFusion: An LLM using diffusion for better code generation
Diffusion-QL: Diffusion-based policies in reinforcement learning
Stable Diffusion: Image synthesis through diffusion

There are dozens of more applications of diffusion — a more comprehensive list can be found here. The results of the recently released CodeFusion are very impressive. With just 75M parameters, it’s able to compete with models up to 175B parameters. This is very promising news because that means you won’t need hundreds of A100 GPUs to make an impact in AI research. What else will we be able to disrupt with diffusion? And more importantly, what will be the next big idea fully changing the game again?

Breeding prompts

Tim Suchanek — Mon, 30 Oct 2023 02:23:15 GMT

Nativ͏e AI app͏lica͏tions such as chatbots, voice a͏ssistants, or͏ a͏gents are power͏ed by ͏L͏LMs. Their performance depends heav͏ily on how the in͏puts or "prompt͏s" are formulated in the m͏odel.͏ Small changes ͏to prompts can have ͏a sign͏i͏ficant i͏mpact on mo͏del out͏put.

Deep͏Mind research͏ers have fo͏und that providi͏ng re͏asoning h͏ints, w͏h͏ich guide th͏e model through logi͏cal steps, si͏gnifi͏cantly i͏mpr͏oves performance in mathematic͏s, i͏n͏f͏ormal reasoning, a͏nd com͏plex͏ tasks.

For example, a prompt like “Let's b͏re͏ak ͏this prob͏lem down ste͏p by st͏ep” will yield a more prec͏ise solution ͏t͏han͏ simply s͏tating th͏e problem.

T͏he͏ chall͏enge ͏is that͏ ef͏fective remi͏nders must ͏be designed ͏manually ͏for e͏ach t͏ask.

This ne͏w researc͏h introduce͏s ͏a ͏method called Si͏milar Reminding, which aut͏omatical͏ly devel͏ops better ͏sugg͏esti͏ons ͏for a given problem.

Here's h͏ow i͏t wor͏ks:

The ͏reminde͏r is͏ initialized with a general des͏c͏ription of the ta͏sk (e.g., solving mat͏h word problems)͏, a͏ "th͏inki͏ng style" prompt (e.g.͏, ͏di͏viding min͏or problem), ͏and "mutant prompts" that c͏reate variati͏on͏s of prom͏p͏ts͏.
I͏t ͏generate͏s a set of variations on ͏the fly a͏nd tests th͏em͏ on sample problems ͏to dete͏rmine their ͏"͏fit" or performance.
Th͏e ͏bes͏t-p͏erformi͏ng prom͏pts a͏re ͏varied ͏and combi͏ned to crea͏te a new g͏enerat͏ion of prompt͏ varia͏tions. This cycle repeats, r͏eta͏ining good ͏mu͏tations and eliminating less effective o͏nes.
Imp͏o͏r͏tantl͏y, the system͏ also evolved its͏ ͏mutant prompt͏s over several ge͏neratio͏ns, continually improv͏ing the͏ way͏ it f͏ind͏s͏ better promp͏ts.

They call this system "P͏r͏omptbr͏e͏eder".͏ I͏n tests, it ͏ou͏tperformed advanced͏ hand-crafted reminder ͏techniques in mathemat͏ics, commo͏n reasoning, ͏a͏n͏d oth͏er tasks. It was able to generate prompts tailored to each problem area automatically.

Thi͏s represents an exciting new ab͏ility fo͏r AI to r͏ecursive͏ly͏ improve themselves by cont͏inuously cr͏eating and tes͏ting͏ var͏iat͏ions͏ on the ͏fly. As mo͏dels become bigger, prompt breeding may beco͏m͏e͏ a͏n increasingly importan͏t t͏ech͏nique for exploiting their ͏full potenti͏al withou͏t ͏the͏ need͏ fo͏r costly retraining. The prompts we provide to AI, and how they shape reasoning may end up being just as important as the underlying algorithms.

However, keep in mind that when prompt "b͏reed͏ing", the ͏O͏penAI bill c͏an͏ quickly expl͏ode.

Not-invented-here syndrome

Tim Suchanek — Sat, 28 Oct 2023 13:13:08 GMT

For most startups, the priority is to leverage proven solutions when it comes to fundamentals like databases, authentication, and storage. The focus should be on meeting customers' specific needs and solving the unique problems they face, rather than getting bogged down in reinventing the wheel.

In the early stages of a startup, time and resources are precious. Choosing to use a custom parser or cache may offer a slight performance advantage, but at what cost? The focus should be on getting the product to market and iterating based on customer feedback, rather than looking for marginal performance gains.

It is important for startups to identify and establish their core competencies, investing resources in these areas to create a solid foundation. However, for non-essential functions, existing libraries and tools should be used to avoid unnecessary waste of time and resources.

Once a startup has achieved product-market fit and has the financial resources to support further growth, it can consider innovating specific parts of its technology. This should only be done if it promises to deliver significant improvements, on the order of 10x, that will have a significant impact on the business or user experience.

The truly innovative and revolutionary work comes later. Startups should first focus on getting to market quickly by reusing existing code and solutions. This approach ensures the bases are covered, allowing innovation to take center stage where it will have the most significant impact.

In short, while there is a time and place to reinvent the wheel, it should not be the first course of action for most startups. Leveraging existing solutions to validate business ideas should be a priority. Once a startup has identified what truly sets it apart from the competition, it can and should boldly reimagine and reshape the wheel where it delivers the most strategic value.

Engineers are often not able to see this treasure of interesting problems hiding at the other end of the rainbow. Engineers want to solve hard problem, I get it. However, it’s the job of strong leadership to guide everyone there, using the boring tools so that later the really interesting problems can be solved.

Tit for tat

Tim Suchanek — Fri, 27 Oct 2023 13:18:19 GMT

The Prisoner's Dilemma is a famous thought experiment that illustrates the challenges of cooperation and competition. Imagine two prisoners are arrested for a crime. The police offer each prisoner a deal: if one denounces the other and the other stays silent, the rat will be released, and the other will receive a long sentence. If both accuse each other, they both receive an average sentence. If both are silent, only a short sentence will be received.

The dilemma is that each prisoner is incentivized to denounce the other, even though both would benefit from cooperating and remaining silent. This simulates real-life situations like the Cold War arms race, where each side built weapons out of fear the other would do the same, even though both would be safer with fewer weapons.

In the 1980s, scientist Robert Axelrod organized a computer tournament where different strategies for solving the prisoner's dilemma competed. People submitted programs with names like Jesus, Lucifer, and Tester. Axelrod has received submissions ranging from a few lines of code to tens of thousands.

The simplest strategy, called Tit for Tat, won. It starts with cooperation, then simply copies what the opponent did last time. It therefore rewards cooperation but punishes betrayal, while leaving room for forgiveness. This "nice but retaliatory" strategy has proven effective and parallels real-world strategies such as the "live and let live" system that emerged in World War II trench warfare. The story illustrates how cooperation, reinforced by measured retaliation, can arise, even in difficult conditions.

Scenius

Tim Suchanek — Thu, 26 Oct 2023 23:55:52 GMT

As I’ve been visiting the AI Pioneers Summit today and seeing the high caliber of people on the stage, representing GitHub Copilot, Salesforce AI, Walmart, LlamaIndex, Langchain, Cloudflare, Mosaic, Replit, Nvidia, AutoGPT (and so on), I’m reflecting about the powerful network effects of San Francisco has and how all these companies are neighbors in the same town.

Brian Eno introduced the concept of "scenius" to describe the collective intelligence and creativity of a cultural scene or group. The term combines "scene" and "genius" to highlight a community's power in fostering innovation and excellence. Scenius is the idea that genius doesn't just exist in individuals, but can also arise from the collaborative efforts of a group.

There are several factors that can help to nurture a scenius, according to Kevin Kelly:

Mutual appreciation: Recognizing and valuing the contributions of each member of the group.
Rapid exchange of tools and techniques: Sharing knowledge and skills to help everyone improve.
Network effects of success: Celebrating and building on the successes of the group as a whole.
Local tolerance for novelties: Being open to new and different ideas.

Some examples of scenius include:

Xerox PARC: A research and development company that has produced many innovative technologies.
Paypal Mafia: A group of former Paypal employees who have gone on to found successful companies like Tesla and LinkedIn.
Silicon Valley: A hub for technology and innovation in California.
YCombinator: A startup accelerator that has helped launch successful companies like Dropbox and Airbnb.
Burning Man: An annual event that fosters creativity and community.

To create a scenius, it's important to find the right balance between openness and structure. A too open space can lack focus, while too much structure can stifle creativity. It's also important to embrace a certain amount of inefficiency and wastefulness, as these can be necessary for the group's vitality.

Who benefits from OSS LLMs?

Tim Suchanek — Wed, 25 Oct 2023 13:17:12 GMT

As use cases become clearer for how to use LLMs, companies look to save costs. Startups like OpenPipe, a company from the last YC batch, provide model distillation as a service. Record all your GPT-4 or Claude 2 interactions and teach them to a smaller model. Not only can you save a lot of money with this, but also potentially even gain higher quality than before. Simple tasks requiring GPT-3.5 can now be delegated to a smaller model, such as Mistral 7b.

How did we get here? Since the Llama release in March, we experienced the Cambrian explosion of open-source LLM models. Hundreds of fine-tuned options of Llama 1, but since then also, Llama 2 have been released, fighting for the first place in the LLM leaderboard. I believe that Mark Zuckerberg did this very intentionally. He knew that Meta wouldn’t have a model on the GPT-4 level yet, which they’d be able to monetize, but at least he’d be able to “eat the bottom” of OpenAI’s revenue by kickstarting this revolution. In other words, this can be seen as asymmetric warfare. Instead of directly attacking someone, I give the power to the people, so that the delta of the OpenAI offering compared to what’s out there shrinks. The game is on, OpenAI needs to deliver to stay in the spot of number one.

So who benefits from this? Everyone besides the few big AI labs.

Enterprises win, as they can now safely deploy legit GPT 3.5 alternatives in their own datacenter, without having to transmit PII
Startups win, as OSS models create a variety of opportunities around inference, training and fine-tuning
Everyone wins, as these models are able to run locally
Everyone wins, because we’re now enabling anyone to do AI research at home on their laptop by finetuning these models

Just OpenAI, well… They might get into trouble.

How agents acquire capabilities

Tim Suchanek — Tue, 24 Oct 2023 12:58:43 GMT

Currently, LLM-powered autonomous agents can solve simple tasks like sending an email, filling in the blanks in a spreadsheet, or even making phone calls. However, as soon as it gets more complex, like implementing a whole database from scratch, controlling a robot in real life, or being able to control the browser universally, current agents fall short.

One of the main reasons for that is the limited reasoning and planning capabilities, but also simply being able to use tools.

There are numerous cases on the internet where someone mentions that GPT-4 can’t do something, but later someone can make it happen. How can an LLM-powered autonomous agent acquire new capabilities it doesn’t have today?

A recent survey about LLM-powered autonomous agents provides an excellent overview:

Finetuning

With Human Annotated Datasets: The dataset to fine-tune is constructed through human feedback, which is turned into natural language, directly used to fine-tune the model. An example of this is the WebShop Dataset, where researchers set up an artificial webshop that lets 1,600 humans use and collect the usage data.
With LLM-Generated Datasets: As recruiting humans to annotate data can be time-intensive and costly, an alternative promising approach is to automate the dataset generation. An example is the recently released Strategic Game Dataset from Laion, which includes 3.2 billion chess games, 236b Rubik cube moves, and 39b maze moves.
With Real-world Datasets: Besides building up artificial labeling processes, collecting data from real-world applications can also be very powerful. This can be real-world product usage, which doesn’t have an explicit human labeling component but yields sufficient data to build a dataset. An example is the Mind2Web dataset, which tries to create a generalistic dataset for the web.

Prompting Engineering

One can provide some few-shot examples to the LLM in the same prompt to improve planning or reasoning. Some agent implementations also utilize prompts around the agent’s beliefs and state of mind.

Mechanism Engineering

Trial and Error: In this method, an agent performs an action and is then invoked again to judge the action just performed. If the action is deemed unsatisfactory, the feedback is incorporated, and the agent iterates.
Crowd-sourcing: The prompt will be delegated to different agents. If they don’t respond with a consistent answer, solutions from other agents are used to update the response. In this method, (sub-)agents can incorporate other agents' opinions to improve the overall outcome.
Experience Accumulation: The agent here does a look-up if it saw a similar task or prompt before and uses that to improve the result.
Self-driven Evolution: The agent sets its own goals and adjusts the approach based on feedback from the environment and a reward function, gradually moving towards the overall goal. With that, the agent can acquire new capabilities in a self-driven manner.