Imagine you have a picture book with different images, like animals, toys, and foods. You want to know more about the pictures, like how many animals are there, what toys are in the picture, or which food is the biggest.
Right now, computers use something called end-to-end models to figure out these things. It's like they look at the whole picture and try to guess the answers, but sometimes, they can't explain how they got those answers.
In this paper, authors came up with a new way for computers to understand pictures, just like how you put puzzles together, piece by piece. This new way is called a ViperGPT, and it uses a helper called GPT-3 Codex. This helper is like a super smart friend who knows a lot about different things.
So, when you want to know something about a picture, you can ask GPT-3 Codex a question, like "how many animals are there?" GPT-3 Codex will then create a set of instructions to find the animals and count them. It's like teaching the computer to look at the picture step-by-step, just like you would do.
This new way of looking at pictures is really helpful because it can explain how it found the answers, and it's also easy to teach the computer new things or improve how it understands the pictures. This makes it better at answering questions about the images in your picture book.
People naturally combine individual steps to understand the visual world, employing compositional reasoning. However, the field of computer vision predominantly relies on end-to-end models, which lack compositional reasoning and interpretability. While end-to-end models have advanced object recognition and depth estimation, they struggle to generalize and remain uninterpretable. The author presents a new framework that leverages code-generating large language models, like GPT-3 Codex, to flexibly compose vision models based on textual queries.