The potential of GPT4-V
As we head toward the OpenAI Dev Day on Monday, people are curious about what they’ll release. Some of the common predictions I’ve seen so far:
Code interpreter API
I see GPT4-Vision API as the most likely option right now, as they already demoed the API at the AI Engineer Summit. While at the summit, OpenAI had a neat little demo around blog post generation — what can it unlock in the real world?
Some ideas on how GPT4-V could be used for good:
Build your own rewind.ai: Rewind AI’s interface to gather context about your life is through vision. It does regular screenshots and analyses the data. Through that, Rewind becomes very useful over time, being able to ask questions about anything we saw. While you can build this on your own with OCR, OCR wouldn’t allow you to ask semantic questions about an image, such as “What can we see here?” “What’s important in this screenshot given my context?”. While typical RAG applications vectorize text from websites, APIs, and databases and do similarity searches on it, it’ll be hard to have a fully comprehensive AI agent without tapping into the most general API of them all: Vision. There’s more to building your own rewind.ai, but the components to do so are being commoditized as we speak.
Improved browser automation: Ultimately, browser engines render a website to be consumed by a human, not a machine. While we can read HTML code with machines and try to make sense of it, at some point it’ll always hit limitations, such as z-index calculations. Being able to use text and image will allow folks to build powerful scrapers and browser automation moving forward.
Generative UI: When asking GPT4-V to implement a UI from a screenshot, it’s already surprisingly good at it. While the results might not be production-ready yet, it’s showing us the potential of the technology. Instead of starting at 0 as with v0.dev, what if you can provide your favorite website designs as a mood board to start with and let the AI do the rest?
Turn images into APIs: When having an image of groceries, GPT-4V was able to return perfect JSON that contains a list of items, including their calories. With GPT4-V, we’ll be able to turn any image or video into an API that can be programmatically accessed and used as a building block for a new frontier of apps. It’s obvious that Llamaindex and Langchain will also add support for this.
The potential of GPT4-V is endless. However, as with any new technology, it’s not yet perfect. A recent study found that OCR models still beat GPT4V, and a certain level of prompt injections is possible through image text. It’s also clear, that it’s not affordable to run GPT4-V over millions of images. A blockbuster has about 300k-600k frames. Turning a whole movie into an API might still be a bit far out. However, we also know that costs will always decrease, and models get faster.
What can one build today that’s just not yet possible but will be possible with this technology soon?