Implementing RAG with Amazon Bedrock and AWS Lambda

Heeki Park
6 min readDec 1, 2023

--

Extending on my prior blog post, I wanted to continue tinkering with generative AI concepts using Amazon Bedrock and AWS Lambda. This time, I wanted to dig in further by understanding and implementing RAG (retrieval augmented generation), which supplements a foundation model with additional context data.

As an example, I used the following as my prompt:

How much concurrency does AWS Lambda add per minute for Lambda functions subscribed to SQS queues? Explain this in detail.

This particular prompt is interesting because the behavior of that specific scaling behavior changed last month (November 2023). I figured the foundation models were trained on data with outdated information. Could I show updated responses to the prompt by using RAG?

Starting with conceptual research

I started by watching Spotify’s CTO, Gustav Söderström, explain generative AI concepts in laymen’s terms. This was helpful in understanding what foundation models are doing — in particular, how text is converted into numerical vectors, and how those vectors are then used to generate the next most likely word. He explains a few key parameters, some background history, and the compression of neural networks into the concept of embeddings.

A key parameter is the temperature parameter, which defines how the model picks the next word. A higher temperature introduces more randomness and thus results in a more creative response. This is how you might see hallucination responses. Alternatively, a lower temperature results in more likely or expected responses. I hesitate to say more factual responses, as I’ve found the model responses to still provide incorrect factual responses even with lower temperature parameters. In my example, I chose to use a lower temperature parameter like 0.1 or 0.2.

The foundation models used in generative AI are built on layers of research and progress. The talk briefly covers neural networks and diffusion models. As depicted below, imagine having a picture of a cat and slowly introducing noise into the picture.

Source: Gustav Söderström’s Spotify talk on AI

Each pass of adding noise to the picture could be a layer in the neural network. Eventually, the picture becomes complete noise. The diffusion model is the reverse process of taking the picture of complete noise and attempting to reconstitute the original image. A conditioned diffusion model does that reverse process but with additional provided clues. This reverse process is what enables generation.

This generation can be done with text, audio, images, you name it. This is because you can convert everything down to some numerical representation. Text, audio, and images are all already digitized and represented as 1s and 0s.

However, rather than representing a word (in the text example) with just a single number, they can be represented with a set of numbers, or a vector. Each number in that vector can represent characteristics or properties that are relevant to the search space.

Ok, enough theory.. let’s build!

Investigating an implementation approach

I started looking around for resources to implement RAG in my code. Most of the resources that I found were specific to developing in Python, which makes sense, since a lot of this work is done in Python. Fortunately, langchain has not only a Python implementation but also a Javascript implementation. While I normally prefer developing in Python, I chose to use Javascript/Node.js because Lambda response streaming is natively supported with the Node.js runtime.

The langchain documentation showed that there are two main components for RAG: 1/ indexing and 2/ retrieval and generation.

For indexing, the following process is required:

  1. Load data using DocumentLoaders
  2. Split the data into smaller chunks using a TextSplitter
  3. Generate the vector embeddings from the splits
  4. Store and index the embeddings in a VectorStore
Source: LangChain documentation (https://python.langchain.com/docs/use_cases/question_answering/)

For retrieval and generation, the following process is used:

  1. Retrieve the relevant embeddings from the VectorStore based on the input using a Retriever
  2. Generate an LLM response with the prompt and retrieved data
Source: LangChain documentation (https://python.langchain.com/docs/use_cases/question_answering/)

Indexing the data

In order to provide context on the prompt that I outlined at the beginning of this post, I exported the AWS blog post that details the updated scaling behavior for Lambda with SQS as an event source as a pdf and stored it in S3.

Next I wrote code to do all the indexing work during function initialization. All of this work should normally be done ahead of the retrieval and generation step, but I chose to keep it simple in my Lambda function since this was just a prototype.

The function initialization starts with downloading the S3 object and storing that in /tmp. I then used PDFLoader to read the pdf document into memory andRecursiveCharacterTextSplitter to split the document into configurable chunk sizes. I played with a few options (128, 256, 512, 1024, 2048) and landed on using 512 for my use case.

I also tried to use the S3Loader but found that I needed to use the unstructured library, which could be installed as a local API server or used as a managed service. I didn’t have time to explore implementing this as a Lambda extension and have left this as a backlog to-do.

The function initialization then generates the vector embeddings using the Amazon Titan text embeddings v1 model. I implemented three different vector stores. I tested out a managed vector store like Pinecone. I tested out using local vector stores like Faiss and the in-memory ephemeral vector store.

Because this was a basic prototype, I chose to use the MemoryVectorStore to keep it simple. This means that every execution environment for Lambda performs this indexing process and stores a local copy of the vector embeddings in memory and thus is not intended as an at-scale production pattern.

Generating responses with updated context

Now that I have the context data stored in an in-memory vector database, I can now send my prompt to the Anthropic Claude v2 model that I implemented in my last blog post, but now with the additional context!

When querying just the Claude model without additional context, I get the following response, which incorrectly states that Lambda scales in increments of 100 concurrent executions per minute (it was 60):

AWS Lambda automatically scales concurrency for Lambda functions that are triggered by SQS queues. The concurrency is scaled up gradually based on the number of messages in the queue. Here are some key details:
- Lambda adds additional concurrency in increments of 100 concurrent executions per minute, up to a maximum of 1000 concurrent executions.
- ...

When querying the Claude model with the additional context, I now get the following (correct and updated) response:

Based on the context provided, AWS Lambda can scale up Lambda functions that subscribe to an SQS queue by adding up to 300 concurrent executions per minute, up to a maximum of 1,250 concurrent executions.
Specifically, the context mentions:
- Lambda functions that subscribe to an SQS queue can scale up to five times faster for queues that see a spike in message backlog, adding up to 300 concurrent executions per minute.
- ...

Conclusion

As you look to build generative AI use cases, RAG enhances the responses of foundation models and allows you to customize the responses based on your specific context.

At re:Invent this year, AWS also released a new feature with Amazon Bedrock called Knowledge Bases. This simplifies the process of creating vector embeddings and augmenting foundation models with RAG context. The work behind this blog post was done prior to availability of that feature. Regardless, it was a fun exploratory exercise and is how I learn these concepts the best — hands on!

Resources

--

--