Insights to Inspire / AI, AI Strategy & Implementation

Building RAG tools: which components should we use?

José Ignacio Orlando

Software engineer with a PhD in machine learning and computer vision in medicine. +10 years of academic and professional experience, including internships, postdoctoral and consulting work at many prestigious institutions. Currently working as AI researcher at CONICET and collaborating with medical start-ups and software companies.

Retrieval Augmented Generation (RAG) is designed to mimic human-like conversational abilities by seamlessly integrating retrieval, encoding, and generation mechanisms. The retrieval component enables the system to access relevant information from a vast corpus of knowledge, while the encoding stage processes this information into a format suitable for generating coherent responses. Finally, the generation module synthesizes the encoded information into natural language responses, creating a fluid conversational experience for users.

One of the key advantages of RAG systems lies in their ability to leverage pre-existing knowledge from large datasets, thereby enhancing the quality and relevance of generated responses. By combining the strengths of retrieval-based methods, which excel at providing accurate and factual information, with generation-based techniques, which offer creativity and fluency in language generation, RAG systems achieve a harmonious balance between precision and contextuality.

In this article we focus particularly on the components needed for crafting these solutions, including references to their costs and different alternatives in the market. Our goal is to help you along the way in implementation, to know the options that you have and which scenarios are the most appropriate for each of them.

Services or infrastructure?

The whole scaffolding of the RAG tool can rely either on APIs from third-party vendors that provide access to ML models or retrieval infrastructure, or you can deploy your own custom solution on a cloud or on-premise infrastructure. In general, this decision depends on how sensitive the data that you access is (e.g. if you need to be HIPPA compliant), and the level of demand that you’re expecting to have. Most existing APIs charge you on a per-request setting, meaning that you only pay for what you use. But if you have data sharing limitations and you cannot go through them because of their terms and conditions, you could mount your own cloud and use custom models there.

Bear in mind, however, that relying on APIs reduces the complexity of implementation, accelerating it and minimizing the teams needed to develop the solution while ensuring 24/7 operations without stops. Alternatively, deploying custom components on the same cloud infrastructure used for operating the RAG system might require hiring additional services and computational power. On the other hand, on-premise infrastructure for RAG needs high performance computing hardware (e.g. acquiring GPUs in top-notch servers) and high speed connectivity (to prevent sudden stops in service due to overload of the connections).

APIs for extracting embeddings and GenAI

OpenAI, Azure, or Google’s Vertex AI provide embeddings and generative models at reasonable costs, charging you by usage, with a cost per token (e.g. by each word or symbol processed by the AI, including the prompts). In general, embeddings are relatively cheap (a few cents per million of tokens), although prices vary depending on the size and complexity of the model used. GenAI, on the other hand, is slightly more expensive. For example, OpenAI’s GPT 3.5 costs a dollar for every million tokens, while GPT 4 is more costly, about 30 dollars a million. The good news is that for most of the applications, especially those only involving text, GPT 3.5 is already enough (as long as you hire a team of good prompters, of course). If you plan to also incorporate support to retrieve images and process their content, however, then you will need to afford GPT 4, which charges you depending on the size of the images.

Vector databases for knowledge retrieval

There are several vendors and libraries for knowledge retrieval in RAG systems. Pinecone is perhaps the most popular, providing its own storage for keeping the embeddings safe and retrieving them in an efficient manner. Its costs depend on three main factors: the amount of reads, writes, and the overall size of the database. You can find more information on their official website. Alternatively, your AI implementation partner could make use of libraries for implementing in-house vector databases, like Faiss and LlamaIndex. These are very useful tools that enable efficient content retrieval without storing your data externally.

Implementation libraries

When planning for implementing a RAG system, your AI implementation partner might present you with ideal teams for development. If you have to analyze their skills, make sure to double-check their experience with libraries like LangChain or LlamaIndex. These are the most popular tools nowadays for implementing systems relying on LLMs, as they provide several predefined interfaces with multiple types of data sources and ML models. There is no need to reinvent the wheel when having libraries that already support interconnecting all these components. With these two tools, you can have a baseline RAG system operative in only a couple of months.

Building RAG tools requires careful consideration of components such as retrieval mechanisms, generative models, and knowledge retrieval databases. Whether you opt for APIs or custom solutions, the goal remains the same: to create seamless, human-like interactions powered by advanced NLP. At Arionkoder, we specialize in AI development and are ready to help you embark on your RAG journey. Schedule a free consultation with us today at [email protected] to explore how we can turn your vision into reality. Let’s revolutionize conversational AI together.

← Don't Just Tell Me What to Build Thriving Through Thick and Thin: The Versatility of Tech Staff Augmentation →