Building AI Products using Large Language Models (LLMs)

A very simplified overview of LLMs

LLMs (Large Language Models) are language models that have been trained on a large amount of data. Their job is to predict the next token (simplified: word) just like an autocomplete system.

The model finds the next word by computing the probability of all the words in the dictionary occurring next based on the previous words. Because of being trained on a huge corpus of high-quality data, they can effectively predict the next word given previous context.

The base models trained on the large corpus data are then fine-tuned using:

instruction datasets (instruction and output) to create instruct models
chat datasets (conversations) to create chat models

These models predict the next word but do it over and over again till the response is completed. Whenever a new message is sent, all the previous messages (System Message, Human Messages, AI Messages) are also sent to provide context.

While the models are good at following instructions and answering questions, they often hallicunate or are unable to solve complex tasks. Understanding that LLMs are next word (token) predictors will help you understand how to make it work effectively for you.

Which model to use? [Updated 11th July 2024]

There are many open-weights and commercial LLMs available.

Some of the popular commercial models are:

OpenAI: GPT models (gpt-4o, gpt-4o-mini, gpt-3.5-turbo)
Anthropic: Claude models (Claude 3 Opus, Claude 3.5 Haiku, Claude 3 Sonnet)
Google: Gemini models (Gemini 1.5 Pro, Gemini 1.5 Flash)
Mistral: Mistral commercial models (Mistral Large, Mistral Small)

There are certain open-weights models that you can either self-host or use hosted versions of them. Some of the popular ones are:

Meta: Llama models (Llama 3.1 405b, Llama 3.1 70b, Llama 3.1 8b)
Mistral: Mistral models (Mixtral 8x22B, Mixtral 8x7B, Mistral 7B)
Google: Gemma models (Gemma 2 27b, Gemma 2 9b)

Currently, gpt-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro and Llama 3.1 405b are the state-of-the-art models.

In general, smaller models are faster and cheaper than larger models but are generally less capable than larger models in complex tasks. AI Research labs are constantly working on improving the performance of these models.

The smaller models mentioned above are faster and cheaper than the state-of-the-art models and generally good enough for simpler tasks.

You can decide which model to use based on the budget and the complexity of the task you want to solve. Generally, it is recommended to use multiple models for different tasks depending on the complexity of the task.

For hosting open models, you can use your preferred cloud providers like AWS, GCP, Azure, etc. There are also companies like Groq, Together AI, Anyscale, etc that provide inference endpoints for the popular models.

LLMs are not just used for text generation based on text prompts. There are many multimodal models available that can generate output in different modalities based on different modalities of input. These modalities include text, image, audio, video, etc. Different models are trained on different modalities. Some popular multimodal models are gpt-4o, Claude 3 and 3.5 models, Gemini 1.5 models, etc.

Prompt Engineering

There are certain best practices that can help make the model generate better responses and/or better follow instructions. These practices are called prompt engineering.

You can read more about prompt engineering in the below articles:

To get better at prompt engineering, practicing these techniques is important. You can do this by trying out on different AI chat products or self-hosting a basic UI and connecting to an LLM API.

Understanding UX for AI Products

Chat-based AI products are the most popular AI products. They can be used for a variety of use-cases like customer support, content generation, etc. While most LLM-based AI Products are chat-based, it is important to understand that it might not be the best UX for all AI products.

Typing can be cumbersome and slow, there are many user personas or use-cases where chat might not be the best UX. Click-based UIs with limited typing, Voice-based UIs, etc can be better UX for many use-cases. Based on the user personas and use-cases, you can decide the best UX for your AI product.

Evals

While building AI products, it is important to evaluate the product through different techniques and metrics. Some of the popular evaluation techniques are:

Human Eval: This involves getting humans to evaluate the capabilities of the AI product. Thumbs-up/Thumbs-down, Ratings, etc are popular ways of doing human eval in-product. It can also be done based on the analytics data collected while the AI product is being used. It can also be done by giving a set of tasks to the human evaluators and asking them to evaluate the AI product based on the tasks.
LLM-based Eval: This involves asking an LLM to evaluate the responses from the AI product on various metrics. For example, you can ask the LLM to evaluate the responses based on the relevance, accuracy, etc.
Unit Tests: This involves creating an evaluation dataset and running tests on the them. The tests can be based on the tasks that the product is supposed to solve. The tests can be based on the input given to and the output generated by the AI product.

There are other evaluation techniques as well. It is important to evaluate the AI product through different techniques to get a better understanding of the product's capabilities.

Frameworks

There are many frameworks available that can help build AI products faster. They provide abstractions over different LLM providers, data sources, embedding models, vector databases, etc. Some of the popular frameworks are:

Langchain
Llamaindex
Haystack While these frameworks make it easier to build AI products, many people prefer to build their own opinionated setup to have more control over the product.

Observability, Monitoring, and Debugging (LLMOps)

Once the AI product is built, it is important to monitor the product to look at the input-ouptut for different use-cases. This helps in understanding how the product is being used and what can be improved. It can also help debug issues that might arise in the product.

It is also important to track metrics like latency, error rate, etc. to understand the performance of the product. There are many LLMOps tools available that can help in monitoring and debugging the AI product.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a strategy that boosts the capabilities of large language models (LLMs). It does so by adding data from trustworthy knowledge base as context alongwith the prompt.

You can read more about RAG in the below articles:

Conclusion

Building AI products using LLMs is not very different from building other software products. It involves understanding the basics of how the systems work, understanding your users, and building a product that solves their problems.

Starting with basic prompt engineering, figuring out the best UX for your product, looking at different input-output samples and then iterating to more advanced techniques like RAG, etc can help you build better AI products.