How OpenAI's GPT-4V is changing the Convergence of Text and Vision

The latest innovation from OpenAI, GPT-4 with Vision (GPT-4V), allows users to command the AI to interpret image inputs. By merging the capabilities of text and vision, GPT-4V promises to redefine user experiences and tackle new challenges.

Deployment Insights

OpenAI introduced GPT-4V to a select group of users earlier this year. Among them was 'Be My Eyes', an organization that aids the visually impaired. They collaborated to create 'Be My AI', a tool that describes the visual world for those who can't see. Feedback from these early users has been invaluable, highlighting both the potential and the areas of improvement for GPT-4V.

Visual Vulnerabilities

Interestingly, the order in which images are presented to GPT-4V can influence its output. For instance, when asked to recommend a state to move to based on two flags, the model tends to favor the first flag shown. This highlights the challenges of ensuring robustness and reliability in AI models.

Safety Measures

OpenAI has always prioritized safety. GPT-4V benefits from safety measures implemented in previous models like GPT-4 and DALL·E. For instance, the model is designed to refuse certain requests, especially those involving images of people, such as identifying them or making ungrounded inferences about them. OpenAI also uses OCR tools to detect and moderate text within images, ensuring that users can't bypass safety measures.

Evaluations: Ensuring GPT-4V's Readiness

OpenAI conducted both qualitative and quantitative evaluations:

  • Addressing Biases: OpenAI studied the model's performance on sensitive trait attributions across demographics, ensuring that the AI doesn't perpetuate harmful stereotypes.
  • Privacy Concerns: The model's ability to identify people in photos was scrutinized, with OpenAI implementing measures to ensure user privacy.
  • Avoiding Misinformation: The model was tested for its susceptibility to disinformation, emphasizing the importance of responsible AI usage.

External Red Teaming: A Critical Eye on GPT-4V

OpenAI collaborated with external experts to assess GPT-4V's limitations and risks:

  • Scientific Proficiency: While GPT-4V showcased impressive capabilities in interpreting complex scientific images, it also exhibited certain limitations, emphasizing the need for cautious reliance on the model for scientific tasks.
  • Medical Advice: The model's inconsistent performance in the medical domain highlighted the importance of not substituting professional medical advice with AI-generated insights.
  • Addressing Stereotypes: OpenAI is actively working to reduce the model's tendencies to make ungrounded inferences or reinforce biases.
  • Disinformation Risks: The combination of image and text generation capabilities poses potential disinformation risks, underscoring the importance of ethical AI usage.
  • Hateful Content: OpenAI is continuously refining GPT-4V to ensure it doesn't inadvertently promote or engage with hateful content.

Future Directions:

GPT-4V is a significant leap forward, but there's more to come. OpenAI is actively seeking public feedback on several ethical and operational questions. For instance, should the model identify public figures from images? Or infer attributes like gender or race? As AI becomes more global, improving its proficiency in various languages and enhancing its image recognition capabilities for diverse audiences is crucial. OpenAI is also working on refining how the model handles sensitive image data.


We research, curate, and publish daily updates from the field of AI. A paid subscription gives you access to paid articles, a platform to build your own generative AI tools, invitations to closed events and open-source tools.
Consider becoming a paying subscriber to get the latest!