members-only post

Hard thing about the hard thing: LLMs

Limitations with LLMs, Tree of Thoughts, LIMA, Proprietary LLMs and much more
Hard thing about the hard thing: LLMs
Photo by Jess Zoerb / Unsplash

As we navigate the landscape of AI and Machine Learning, we're often awestruck by the promise of Large Language Models (LLMs) like GPT-3.5-turbo, Claude, or the most recent iteration, GPT-4. These models offer transformative opportunities in various industries. However, they are not without their limitations and eccentricities.

Just last week, we conducted a deep dive into prompt engineering with GPT models, an experience that was as enlightening as it was challenging. Today will delve into the constraints and challenges we uncovered, both during that session and through other engagements with these models, and shed light on the areas that remain a work in progress.

Limitation of Context Window

LLMs have an inherent constraint - the limit to the amount of input they can accept, referred to as the context window. This limit encompasses everything, including your inputs, the LLM's potential outputs, and any supplementary data you want to inject. The size of this window can significantly influence the model's ability to provide coherent and contextually relevant responses. So, as powerful as LLMs may be, their knowledge and comprehension are bound by this window.

Foundation Model 101 — Is Large Context Window A Trend?
The context window of large language models (LLMs) is the range of tokens the model can consider when generating responses to prompts. GPT…

The Latency Issue

Performance is crucial in any AI application, and LLMs are no exception. Despite being the best models available at present, commercial LLMs like GPT-3.5-turbo and Claude often take several seconds to generate a valid response. This latency can range from a mere couple of seconds to over 15 seconds, contingent on factors such as the model, natural language input, schema size, schema composition, and instructions in the prompt. GPT-4, although accessible, is still far too slow for real-time applications.

Many have proposed the idea of using LangChain to chain together LLM calls and improve outputs. However, this approach exacerbates the latency issue and introduces the risk of compounding inaccuracies due to 'compound probabilities'. Of course there are ways to reducethem but nothing compares to almost real time response thats needed in many cases.

Optimizing Latencies in Text Generation and LLM Models
The model’s latency or the time it takes for a request to be processed and a response to be returned, is a crucial factor in the…

The Art of Prompt Engineering

This post is for subscribers only

Subscribe to continue reading