Building a RAG Project using Langchain, Colab and Mistral

deep-learning
gen-ai
nlp
A code-first tutorial to get started with RAG systems.
Author

Shubham Shinde

Published

February 23, 2024

In this blogpost we will build a toy project for RAG using Langchain in a free-tier Google Colab environment, using a quantized Mistral model.

Prerequisites - You should know what LLMs are, what embeddings are, and are looking for a place to start practising your RAG skills.

You can run the code in this post end-to-end in a free-tier google colab notebook.

RAG

We can use LLMs like ChatGPT, Gemini for a lot of things. But there can be cases where they are inaccurate or clueless about. In many cases, we can improve the replies of LLMs by giving them some documents that contain information about the query. Generally, we might have a bunch of documents (or images, videos) that we want the LLM to use as reference.

RAG is one way to do it. We divide the source of truths (i.e. documents) into chunks. For any input query, the closest chunk/s is found and fed alongwith the query to the LLM.

In order to build any RAG system, we need a source of documentation that we can use to give answers. Since this is a toy project we shall use something fun - like the wiki of a famous television show, Breaking Bad! With Mediawiki, we don’t need to spend much time on data scraping, and it will be interesting to see if our RAG system can answer trivia questions, as well as general questions about the topic.

!pip install langchain langchainhub rank_bm25 accelerate peft bitsandbytes langchain-community markdown huggingface_hub sentence-transformers optimum ragatouille chromadb anyascii --quiet
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/289.9 kB ? eta -:--:--     ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.4/289.9 kB 1.5 MB/s eta 0:00:01     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 286.7/289.9 kB 4.4 MB/s eta 0:00:01     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289.9/289.9 kB 3.6 MB/s eta 0:00:00
from huggingface_hub import notebook_login

notebook_login()

Download and Clean Data

Any mediawiki site, in our case this - Breaking_Bad_Wiki will contain periodic dumps of that particular wiki. We can simply navigate to https://breakingbad.fandom.com/wiki/Special:Statistics to download the dump.

The dump is in XML format that we will convert to json, and then to markdown.

You can directly use the plaintext content of each page - however I cleaned up the data by converting the files to markdown, because the article format is similar to markdown and we can make use of the metadata.

# download and extract the xml file
!wget -q https://s3.amazonaws.com/wikia_xml_dumps/b/br/breakingbad_pages_current.xml.7z
!7z x breakingbad_pages_current.xml.7z
# install the dependencies for data cleaning
!pip install anyascii html2text wikitextparser --quiet
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/66.7 kB ? eta -:--:--     ━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━ 30.7/66.7 kB 752.8 kB/s eta 0:00:01     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.7/66.7 kB 920.6 kB/s eta 0:00:00
# clone this repo, and use the scripts to clean xml to json and then to markdown
!git clone https://github.com/shindeshu/PlainTextWikipedia.git
!python PlainTextWikipedia/wiki_to_text.py --xml-file breakingbad_pages_current.xml --output-dir bbad_json
!python PlainTextWikipedia/convert_to_markdown.py --json-dir bbad_json --markdown-dir breakingbad_txt

Chunking

Chunking is an important part of a RAG system- since unideal choices could lead to poor retrieval. If our chunk size is too small, we may miss out on larger context. If our chunk size is too large, the embeddings may be poorer and we might add more noise to the prompt.

There are many techniques for splitting into chunks - you can read them here. In our case, we are splitting twice: Once by markdown headers (heading 1/2/3), and if there’s any chunk that’s very large, we are chunking it using Recursive splitting.

We are using the splitters from Langchain.

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader, TextLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import BM25Retriever, EnsembleRetriever
import json
# mounting the drive to save and load data (like .md files, vector db, etc.)
from google.colab import drive
drive.mount('/content/drive')
# !unzip -q /content/drive/MyDrive/Data\ Science/RAG/breakingbad_txt.zip
def get_chunks(path = 'breakingbad_txt/'):
    """
    This function loads the markdown files in given folder, and then
    splits them twice - once on the headers, second on the length.
    """
    loader = DirectoryLoader(path, glob="*.md", loader_cls=TextLoader)
    data = loader.load()
    print(f"Number of Documents: {len(data)}")

    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
        ("####", "Header 4"),

    ]
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

    text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,)

    header_splits = []
    for doc in data:
        header_splits.extend(markdown_splitter.split_text(doc.page_content))
    texts = text_splitter.split_documents(header_splits)
    print(f"Number of Chunks: {len(texts)}")
    return texts

def viz_docs(docs):
  print("\n\n".join([i.page_content for i in docs]))
texts = get_chunks()
Number of Documents: 1992
Number of Chunks: 13912

Embeddings and Vector Databases

For any given query, we need to find that chunk (or chunks) that is related to it. There are two ways of retrieval:

  1. Sparse retrieval - Techniques like TF-IDF, BM25 which use word frequencies to understand related ness.
  2. Dense Retrieval - Trained models convert text into dense representations (embeddings), which can be used to find closest chunk to given query.

Generally dense retrieval is quite powerful so in this notebook we will follow it. But there are cases where sparse retrieval also becomes useful. We can also combine both approaches in a hybrid retriever.

The smallest embedding model is all-MiniLM-L6-v2 which is still quite powerful. You can play around with different embedding models to find which one suits best. We go with BAAI/bge-small-en-v1.5 for this notebook. If compute is not a constraint, you can also try BGE-large, or BGE-M3 which is state-of-the-art at the time of this post.

Vector Databases

How do we find the closest chunk to the given query? We compute embeddings for all chunks, and an embedding for the query. Now, we compute cosine similarity between the query embedding and document embeddings to find the closest chunk.

Vector databases make many things easier in this workflow.

  • they store each chunk and its associated embedding
  • they retrieve the closest chunk using embedding and approximate nearest neighbors algorithms.
  • they manage storage, additions, removal, updating of embeddings.

In our case, we are using a local vector database called Chroma that will sit on our disk (no API)

device = "cuda:0"#  'cpu'
# # langchain's interface to huggingface/sentence transformers models running locally
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5", model_kwargs={"device": device})
# # Equivalent to SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")
# smallest model = "all-MiniLM-L6-v2"
# # can also use bge-large or bge-m3 if no compute constraints
# load previously computed database stored in Google Drive containing text and embeddings
load_from_prev = True
if load_from_prev:
    !cp -r  /content/drive/MyDrive/Data\ Science/RAG/bbad_bgesmall_chroma/ ./bbad_bgesmall_chroma/
    db = Chroma(persist_directory="./bbad_bgesmall_chroma", embedding_function=embeddings)
else:
    # compute embeddings, store to a database, and copy the database to Google Drive
    db = Chroma.from_documents(texts, embeddings, persist_directory="./bbad_bgesmall_chroma/")
    !cp -r ./bbad_bgesmall_chroma/ /content/drive/MyDrive/Data\ Science/RAG/bbad_bgesmall_chroma/

Query Examples

To test out the retrievers and the entire system, we will create some sample queries. Some are created manually, while some I got chatgpt to generate for us.

queries = ["how did jane, the girlfriend of jessie, die?", "which actor plays hank schrader?",
           "which was the synthesization method used by walter to produce meth?", "who were the accomplices of gus fring in his drug trade?",
           "What was the relationship between Saul Goodman and the Salamancas?",
           "what was the profession of walter white prior to dealing in drugs?",
           "does gus fring ever get caught by the authorities for his crimes?",
           "was hank schrader ever able to find the identity of heisenberg? ",
           "how did hank schrader find out the identity of heisenberg?"]

queries_by_chatgpt = [
    "What is the chemical element and its symbol used to represent the blue meth in Breaking Bad?",
    "Who is the actor that portrays the character Walter White?",
    "In what city does Breaking Bad primarily take place?",
    "What is the name of Walter White's alter ego as a methamphetamine manufacturer?",
    "What is the name of Jesse Pinkman's friend who assists him in his drug dealing?",
    "What is the name of the car wash that serves as a front for money laundering in Breaking Bad?",
    "What is the name of the fast-food restaurant where Saul Goodman's office is located?",
    "What is the name of the company that Gustavo Fring owns and uses as a front for his drug empire?",
    "What is the name of the chemical supply company where Walter White used to work?",
    "What is the name of Walter White's wife?",
    "What is the name of Hank Schrader's wife?",
    "Who is the DEA agent pursuing Walter White?",
    "What is the name of the drug cartel led by Tuco Salamanca?",
    "What is the street name of the drug that Walter White and Jesse Pinkman produce?",
    "What is the name of the lawyer who frequently represents Walter White and Jesse Pinkman?",
    "What is the name of the nursing home where Hector Salamanca resides?",
    "What is the nickname given to Gustavo Fring's drug distribution network?",
    "What is the name of the superlab where methamphetamine is produced in Breaking Bad?",
    "What is the name of the white supremacist gang that becomes a major antagonist in later seasons?",
    "What is the significance of the pink teddy bear in Breaking Bad?",
    "What is the significance of the fly episode in Breaking Bad?",
    "What is the significance of Walter White's hat in Breaking Bad?",
    "What is the significance of the blue meth in Breaking Bad?",
    "What is the significance of the pizza on the roof in Breaking Bad?",
    "What is the significance of the turtle in Breaking Bad?",
    "What is the significance of the color symbolism in Breaking Bad?",
    "What is the significance of the song Baby Blue in the Breaking Bad finale?",
    "What is the significance of Walter White's car in Breaking Bad?",
    "What is the significance of the pink bear's eye in Breaking Bad?",
    "What is the nature of the relationship between Walter White and Jesse Pinkman?",
    "How does the relationship between Walter White and Skyler White evolve throughout Breaking Bad?",
    "What is the dynamic between Walter White and Hank Schrader?",
    "Describe the relationship between Jesse Pinkman and Jane Margolis.",
    "How does the relationship between Gustavo Fring and Hector Salamanca change over time?"
    "What is the main conflict driving the plot of Breaking Bad?",
    "Describe the key events leading to Walter White's transformation into Heisenberg.",
    "How does the conflict between Walter White and Gus Fring escalate throughout the series?",
    "What role does Saul Goodman play in the overall storyline of Breaking Bad?",
    "Explain the significance of the `crawl space` scene in Breaking Bad.",
    "What ethical dilemmas does Walter White face as he becomes involved in the drug trade?",
    "How do the characters in Breaking Bad justify their actions morally?",
    "Discuss the theme of morality and consequences in Breaking Bad.",
    "What role does redemption play in the character arcs of Breaking Bad?",
    "Examine the ethical implications of Jesse Pinkman's involvement in the drug trade."
]
qlist = queries + queries_by_chatgpt

Now, let’s try out the similarity search with different queries

query = queries[2]
print(query)
docs = db.similarity_search(query, k=20)
# docs = db.max_marginal_relevance_search(query, k=10, fetch_k=20, lambda_mult=0.9)
print(docs[0].page_content)
which was the synthesization method used by walter to produce meth?
Walter Hartwell "Walt" White Sr., also known by his clandestine pseudonym and business moniker Heisenberg and also frequently referred to as Mr. White, is an American former chemist and major narcotics distributor from Albuquerque, New Mexico, whose drug empire became the largest meth operation in American history, surpassing both Gustavo Fring's drug empire and the Cartel's. Before entering the drug trade, Walt was a respected chemist and scientist who later worked as an overqualified high school chemistry teacher at J. P. Wynne High School alongside working at the A1A Car Wash to financially support his family (his wife Skyler, son Walt Jr., and infant daughter Holly). After being diagnosed with terminal lung cancer, Walt started manufacturing chemically pure crystal methamphetamine to provide for his family upon his death. Knowing nothing about the drug trade, Walt enlisted the aid of his former student, Jesse Pinkman, to sell the meth he produced. Walt's scientific knowledge and
viz_docs(docs[:4])
Walter Hartwell "Walt" White Sr., also known by his clandestine pseudonym and business moniker Heisenberg and also frequently referred to as Mr. White, is an American former chemist and major narcotics distributor from Albuquerque, New Mexico, whose drug empire became the largest meth operation in American history, surpassing both Gustavo Fring's drug empire and the Cartel's. Before entering the drug trade, Walt was a respected chemist and scientist who later worked as an overqualified high school chemistry teacher at J. P. Wynne High School alongside working at the A1A Car Wash to financially support his family (his wife Skyler, son Walt Jr., and infant daughter Holly). After being diagnosed with terminal lung cancer, Walt started manufacturing chemically pure crystal methamphetamine to provide for his family upon his death. Knowing nothing about the drug trade, Walt enlisted the aid of his former student, Jesse Pinkman, to sell the meth he produced. Walt's scientific knowledge and

chemistry expertise helped eliminate any differences in quality between the two methods. During the course of the series, methylamine is not synthesized by either Walter White or Jesse Pinkman, and thus the search for methylamine for use as a methamphetamine precursor is a large plot point throughout the series. For example, Walt and Jesse once stole methylamine from a warehouse, and, on a later date, from a freight train. The source of methylamine during their cook sessions in the superlab was Golden Moth Chemical.

isomer. No explanation is given how White was able to obtain only the d-meth enantiomer. This is lampshaded when Walt asks Victor, "If our reduction is not stereospecific, then how can our product be enantiomerically pure?". **It's possible that Walter might have used some unknown enantioselective catalyst to yield only d-methamphetamine, which also explains his early obsessive fear of a foreign organism contaminating his batch in the episode "Fly". *During an interview with Howard Stern, Bryan Cranston revealed that DEA chemists showed him and Aaron Paul how to make crystal meth, but they never actually completed the process correctly. *Throughout the show, a big fuss is made of the purity of meth. In reality, multiple recrystallizations are commonly used to increase the purity of any organic chemical compound. *After Breaking Bad, many meth manufacturers in real life dyed their product blue.

Gus originally intends for Gale to be his sole meth cook. Gale excitedly uses a box cutter to unpack the machinery and assemble the new superlab. He informs Gus that a meth sample Gale had analyzed (synthesized by Walter White) was, by far, the best that he had ever seen (with a 99% pure quality). Thus, it was the urging of Gale that caused Gus to put aside his reservations and hire Walter .

We can see that the retriever is returning us the chunks that are related to the original query, and which can lead the LLM to a better, more accurate answer.

# lets store the most similar context for benchmarking
sims = [db.similarity_search(q, k=2)[0].page_content for q in qlist]
with open("similarity_search_bgesmall.json", "w") as fp:
  json.dump(dict(zip(qlist, sims)), fp, indent=4)

BM25 Retriever

Retrieval with embeddings, like that of BGE or MiniLM, is called dense retrieval. You can also use a sparse retriever that leverages word frequencies like BM25, which can also be useful in some cases.

We can also use use a hybrid retriever - a combination of both dense and spare.

As we can see- even basic BM25 search is yielding decent results.

bm25_retriever = BM25Retriever.from_documents(texts)
result = bm25_retriever.get_relevant_documents(query)
print(query)
viz_docs(result[:3])
which was the synthesization method used by walter to produce meth?
Pseudoephedrine, also known as Pseudo or Sudo for short, is a chemical compound commonly found in anti-allergy medicines. It is also frequently used as the primary ingredient in the production of methamphetamine, which resulted in its use being heavily regulated by law. Walter White and Jesse Pinkman originally used the pseudoephedrine method to make their crystal meth, but soon switched to using methylamine when it became obvious that they could not feasibly obtain enough pseudoephedrine to produce the quantities that they needed to.

Walt and Skyler use the "structuring" method, which is a method of placement by which cash is broken into smaller deposits of money, used to defeat suspicion of money laundering and to avoid anti-money laundering reporting requirements. A sub-component of this is to use smaller amounts of cash to purchase bearer instruments, such as money orders, and then ultimately deposit those, again in small amounts. Gustavo Fring also uses this method with his fast-food restaurant chain Los Pollos Hermanos. Though this was less obvious as the chain had a larger structure and was theoretically possible to be making large amounts of money.

After experiences in the meth trade with Krazy-8 and Emilio, Walter and Jesse eventually decide to expand their drug operation by selling their product to Tuco Salamanca, a powerful but psychopathic drug distributor. The two begin to expand their operations by stealing a large drum of methylamine, thereby allowing them to produce large quantities of meth for Tuco. The methylamine allows them to bypass the difficulty of acquiring pseudoephedrine, and the new method gives their product a blue color while continuing to be highly pure and chemically potent.

Reranker

With retrieval like above, you already get the related documents for the query, which you can pass to the LLM as context.

However, there is an additional step in the between that can really improve your retrieval results, which is called reranking.

Let us say the retriever picks up 20 documents out of 100k that are related to the query. A reranker will then look at those 20 documents, and re-arrange their order based on their importance to the query. This has been seen to improve the retrieval results tremendously in many cases.

(Then why can’t we use rerankers for the entire data? That’s because they are more compute expensive, and may take relatively more time to return results

For the architectural differences between rerankers and plain similarity search, see this link)

# a helper function for reranker
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder, util

class Reranker():
  def __init__(self, embedding_func):
    self.embedding_func = embedding_func
    self.cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device=device)
                                      # 'cross-encoder/ms-marco-MiniLM-L-6-v2', device=device)
                                      # BAAI/bge-reranker-large


  def rerank(self, query, texts, top_k=10, print_n=3):
    to_print = False if print_n == -1 else True
    question_embedding = torch.tensor(self.embedding_func.embed_query(query), device=device)
    corpus_embeddings = torch.tensor(self.embedding_func.embed_documents(texts), device=device)
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, texts[hit['corpus_id']]] for hit in hits]
    cross_scores = self.cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]
    if to_print:
        print(f"Top-{print_n} Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:print_n]:
        if to_print:
            print("\t{:.3f}\t{}".format(hit['score'], texts[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    if to_print:
        print("\n-------------------------\n")
        print(f"Top-{print_n} Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:print_n]:
        if to_print:
            print("\t{:.3f}\t{}".format(hit['cross-score'], texts[hit['corpus_id']].replace("\n", " ")))
    return texts[hits[0]['corpus_id']]

  def rerank_documents(self, query, documents, top_k=10, print_n=3):
    texts = [i.page_content for i in documents]
    return self.rerank(query, texts, top_k, print_n)
ce_reranker = Reranker(embedding_func=embeddings)
topn_results = ce_reranker.rerank_documents(query=query,
                                            documents=db.similarity_search(query, k=20),
                                            top_k=20,
                                            print_n=4)
Top-4 Bi-Encoder Retrieval hits
    0.796   Walter Hartwell "Walt" White Sr., also known by his clandestine pseudonym and business moniker Heisenberg and also frequently referred to as Mr. White, is an American former chemist and major narcotics distributor from Albuquerque, New Mexico, whose drug empire became the largest meth operation in American history, surpassing both Gustavo Fring's drug empire and the Cartel's. Before entering the drug trade, Walt was a respected chemist and scientist who later worked as an overqualified high school chemistry teacher at J. P. Wynne High School alongside working at the A1A Car Wash to financially support his family (his wife Skyler, son Walt Jr., and infant daughter Holly). After being diagnosed with terminal lung cancer, Walt started manufacturing chemically pure crystal methamphetamine to provide for his family upon his death. Knowing nothing about the drug trade, Walt enlisted the aid of his former student, Jesse Pinkman, to sell the meth he produced. Walt's scientific knowledge and
    0.783   chemistry expertise helped eliminate any differences in quality between the two methods. During the course of the series, methylamine is not synthesized by either Walter White or Jesse Pinkman, and thus the search for methylamine for use as a methamphetamine precursor is a large plot point throughout the series. For example, Walt and Jesse once stole methylamine from a warehouse, and, on a later date, from a freight train. The source of methylamine during their cook sessions in the superlab was Golden Moth Chemical.
    0.783   isomer. No explanation is given how White was able to obtain only the d-meth enantiomer. This is lampshaded when Walt asks Victor, "If our reduction is not stereospecific, then how can our product be enantiomerically pure?". **It's possible that Walter might have used some unknown enantioselective catalyst to yield only d-methamphetamine, which also explains his early obsessive fear of a foreign organism contaminating his batch in the episode "Fly". *During an interview with Howard Stern, Bryan Cranston revealed that DEA chemists showed him and Aaron Paul how to make crystal meth, but they never actually completed the process correctly. *Throughout the show, a big fuss is made of the purity of meth. In reality, multiple recrystallizations are commonly used to increase the purity of any organic chemical compound. *After Breaking Bad, many meth manufacturers in real life dyed their product blue.
    0.779   Gus originally intends for Gale to be his sole meth cook. Gale excitedly uses a box cutter to unpack the machinery and assemble the new superlab. He informs Gus that a meth sample Gale had analyzed (synthesized by Walter White) was, by far, the best that he had ever seen (with a 99% pure quality). Thus, it was the urging of Gale that caused Gus to put aside his reservations and hire Walter .

-------------------------

Top-4 Cross-Encoder Re-ranker hits
    3.171   for his family upon his death. Knowing nothing about the drug trade, Walter enlisted the aid of his former student, Jose Miguel Rosas, to sell the meth he produced. Walter's scientific knowledge and dedication to quality lead him to produce crystal meth that is purer and more potent than any competitors'. To avoid the tedious collection of pseudoephedrine required for production, Walter devises an alternative chemical process utilizing methylamine, giving his product a distinctive blue color. His crystal meth dominates the market, leading to confrontations with established drug makers and dealers. Although Walter and Jose began as amateur small-time meth cooks, manufacturing the drug out of a stolen school bus in the jungle-covered outskirts of Bogota, and being met with very limited success, Walter and Jose soon climbed up the drug hierarchy, killing or systematically destroying anyone who impeded them. Because of his drug-related activities, Walter eventually finds himself at odds
    3.109   In order to produce higher amounts of crystal meth, Walt tells Jesse that they're switching the recipe, from the classical pseudo reduction to the older method of reductive amination of phenylacetone (P2P) with methylamine, and they will get P2P by heating phenylacetic acid with acetic acid or acetic anhydride in a tube furnace, in the presence of thorium oxide catalyst. Later, in Jesse's basement, we see them making phenylacetone from phenylacetic acid using a 70 mm tube furnace.
    2.888   Pseudoephedrine, also known as Pseudo or Sudo for short, is a chemical compound commonly found in anti-allergy medicines. It is also frequently used as the primary ingredient in the production of methamphetamine, which resulted in its use being heavily regulated by law. Walter White and Jesse Pinkman originally used the pseudoephedrine method to make their crystal meth, but soon switched to using methylamine when it became obvious that they could not feasibly obtain enough pseudoephedrine to produce the quantities that they needed to.
    2.764   isomer. No explanation is given how White was able to obtain only the d-meth enantiomer. This is lampshaded when Walt asks Victor, "If our reduction is not stereospecific, then how can our product be enantiomerically pure?". **It's possible that Walter might have used some unknown enantioselective catalyst to yield only d-methamphetamine, which also explains his early obsessive fear of a foreign organism contaminating his batch in the episode "Fly". *During an interview with Howard Stern, Bryan Cranston revealed that DEA chemists showed him and Aaron Paul how to make crystal meth, but they never actually completed the process correctly. *Throughout the show, a big fuss is made of the purity of meth. In reality, multiple recrystallizations are commonly used to increase the purity of any organic chemical compound. *After Breaking Bad, many meth manufacturers in real life dyed their product blue.

In practice, the results may differ. I saw poor results with the smallest reranker (ms-marco-MiniLM-L-6-v2) but the model BAAI/bge-reranker-large gave me better results. Both are better than plain embeddings search.

# note that this reranker class is not compatible with langchain
# we simply set up this for quick evaluation of rerankers
# later in this notebook we will setup a langchain compatible reranker which will be a part of the chain
# hence deleting this
del ce_reranker

Trying Out ColBERT

ColBERT is a dense retrieval method - however instead of using one embedding per chunk (which other methods do), it uses multiple embeddings per chunk. This enables better fine-grained retrieval.

There are two ways we can try out ColBERT.

ColBERT as a reranker

Using MiniLM we have embeddings that we are using for semantic search. We found that using cross-encoders for reranking is giving us good results. Now, ColBERT can also be used instead of cross-encoder model to rerank the result.

ColBERT as a retriever

Or we can skip the MiniLM model altogether, and use ColBERT for the retrieval as well!

Results

ColBERT, for this use case, was on par with MiniLM models but not better than the BGE series of models we are using. Plus, it took a lot of time for computing embeddings, and there are a lot of embeddings even for a small corpus. So the feasibility of ColBERT for large corpora cannot be guaranteed.

The code for ColBERT hence isn’t a part of this blogpost, but can be found in the Colab notebook if you want to try it out.

Rerank Compatibility with Langchain

Despite the usefulness of a reranker, there is no direct support for a sentence-transformer class in Langchain. Langchain supports only the Cohere Reranker API.

There are two ways to work around this:

  1. Create your own “chain” where you code the retrieval, reranker, prompt creation, and LLM generation.
  2. Create a reranker using Langchain’s document compressor class and use the native Langchain chaining.

We go with #2 here, and create a reranker using document compressor class, which we will use in the final chain.

from __future__ import annotations
from typing import Dict, Optional, Sequence
from langchain.schema import Document

from langchain.callbacks.manager import Callbacks
from langchain.retrievers.document_compressors.base import BaseDocumentCompressor

from sentence_transformers import CrossEncoder
from langchain.pydantic_v1 import Extra, root_validator

class BgeRerank(BaseDocumentCompressor):
    # model_name:str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
    model_name:str = 'BAAI/bge-reranker-large'
    """Model name to use for reranking."""
    top_n: int = 3
    """Number of documents to return."""
    model:CrossEncoder = CrossEncoder(model_name)
    """CrossEncoder instance to use for reranking."""

    def bge_rerank(self,query,docs):
        model_inputs =  [[query, doc] for doc in docs]
        scores = self.model.predict(model_inputs)
        results = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
        return results[:self.top_n]

    class Config:
        """Configuration for this pydantic object."""

        extra = Extra.forbid
        arbitrary_types_allowed = True

    def compress_documents(
        self,
        documents: Sequence[Document],
        query: str,
        callbacks: Optional[Callbacks] = None,
    ) -> Sequence[Document]:
        if len(documents) == 0:  # to avoid empty api call
            return []
        doc_list = list(documents)
        _docs = [d.page_content for d in doc_list]
        results = self.bge_rerank(query, _docs)
        final_results = []
        for r in results:
            doc = doc_list[r[0]]
            doc.metadata["relevance_score"] = r[1]
            final_results.append(doc)
        return final_results
# initialize and test the reranker
reranker = BgeRerank()
reranker.compress_documents(documents=docs, query=query)
[Document(page_content='By , obtaining pseudoephedrine for the large-scale production that Walt desires becomes an issue. To circumvent this, Walt decides on an alternate synthesis - reductive amination of phenyl-2-propanone (phenylacetone or P2P) with methylamine. Obtaining methylamine required for this reaction - which is on the DEA watch list, a list of chemicals the DEA has classified as having use in drug manufacture - becomes a major plot line throughout the seasons. While working in the superlab, the methylamine is supplied by Golden Moth Chemical. Walt obtains his P2P from phenylacetic acid and acetic anhydride. The P2P is created in a tube furnace charged with a thorium oxide catalyst. The reductive amination of P2P takes place in the presence of aluminum amalgam. From the 1960s to the mid-1980s, reductive amination was the method of choice for clandestine methamphetamine production. Enterprising biker gangs who dominated the trade at this time mostly ran these operations. (The slang term "crank" for', metadata={'Header 1': 'Blue Sky', 'Header 2': 'The Chemistry', 'Header 3': "Walt's method", 'relevance_score': 0.9652976}),
 Document(page_content="for his family upon his death. Knowing nothing about the drug trade, Walter enlisted the aid of his former student, Jose Miguel Rosas, to sell the meth he produced. Walter's scientific knowledge and dedication to quality lead him to produce crystal meth that is purer and more potent than any competitors'. To avoid the tedious collection of pseudoephedrine required for production, Walter devises an alternative chemical process utilizing methylamine, giving his product a distinctive blue color. His crystal meth dominates the market, leading to confrontations with established drug makers and dealers. Although Walter and Jose began as amateur small-time meth cooks, manufacturing the drug out of a stolen school bus in the jungle-covered outskirts of Bogota, and being met with very limited success, Walter and Jose soon climbed up the drug hierarchy, killing or systematically destroying anyone who impeded them. Because of his drug-related activities, Walter eventually finds himself at odds", metadata={'Header 1': 'Metastasis/Walter Blanco', 'relevance_score': 0.92187774}),
 Document(page_content="In order to produce higher amounts of crystal meth, Walt tells Jesse that they're switching the recipe, from the classical pseudo reduction to the older method of reductive amination of phenylacetone (P2P) with methylamine, and they will get P2P by heating phenylacetic acid with acetic acid or acetic anhydride in a tube furnace, in the presence of thorium oxide catalyst. Later, in Jesse's basement, we see them making phenylacetone from phenylacetic acid using a 70 mm tube furnace.", metadata={'Header 1': 'Phenylacetic acid', 'Header 2': 'History', 'Header 3': 'Season 1', 'relevance_score': 0.90127265})]
# convert the db object to langchain retriever
# this db contains all our texts and their embeddings
retriever = db.as_retriever() # search_type="mmr")
print(query, "\n")
retriever.invoke(query)[0].page_content
which was the synthesization method used by walter to produce meth? 
'Walter Hartwell "Walt" White Sr., also known by his clandestine pseudonym and business moniker Heisenberg and also frequently referred to as Mr. White, is an American former chemist and major narcotics distributor from Albuquerque, New Mexico, whose drug empire became the largest meth operation in American history, surpassing both Gustavo Fring\'s drug empire and the Cartel\'s. Before entering the drug trade, Walt was a respected chemist and scientist who later worked as an overqualified high school chemistry teacher at J. P. Wynne High School alongside working at the A1A Car Wash to financially support his family (his wife Skyler, son Walt Jr., and infant daughter Holly). After being diagnosed with terminal lung cancer, Walt started manufacturing chemically pure crystal methamphetamine to provide for his family upon his death. Knowing nothing about the drug trade, Walt enlisted the aid of his former student, Jesse Pinkman, to sell the meth he produced. Walt\'s scientific knowledge and'
# combine the retriever and reranker into a single compression retriever
from langchain.retrievers import ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=retriever
)
idx = -4
print(queries_by_chatgpt[idx])
compression_retriever.get_relevant_documents(queries_by_chatgpt[idx])
How do the characters in Breaking Bad justify their actions morally?
[Document(page_content="also rarely, if ever, admits responsibility for problems that are clearly his own fault and is quick to blame others and find an excuse for said problems. Notable examples of his ignorance in this regard include him bitterly blaming his former colleagues Gretchen and Elliot for ruining his life and stealing his work all the while completely ignoring the fact that he himself chose to leave the business he helped to co-fund, later revealed to be due to feelings of inferiority to Gretchen's family and set himself down pathways to failure. Another noted example is evident when he blames Mike for screwing up and putting the DEA on his own trail while refusing to admit that his killing of Gus did nothing but cause disaster and put the DEA on all of Gus' former associates. Walt's severe ignorance makes him almost the polar opposite of Jesse who actually faces and feels remorse for what he has done and accepts responsibility for it. This is highlighted by Jesse blaming himself for Jane's", metadata={'Header 1': 'Walter White/Personality and Traits', 'relevance_score': 0.99879384}),
 Document(page_content="to his will, and enrich himself without limit for the sake of obtaining power, even if it means to hurt or kill other people. Walt is an extremely prideful, egotistical and arrogant man who takes criticism extremely poorly and his pride blinds him to the point that he makes poor and costly decisions despite having high intelligence. Despite his massive ego, he does have genuine insecurities though he almost always refuses to acknowledge or confront them. A notable example of his fragile pride and ego is evident when he and Skyler need to buy a business to launder their drug money, Walt becomes determined to purchase the very same car wash that wounded his pride when Skyler mentions that the owner, Bogdan, insulted his manhood. Walt also refuses to let Bogdan keep his framed dollar on the wall, and out of spite he decides to use that dollar to buy a soda from the vending machine . Walt's pride, ego and arrogance is what keeps him from accepting any form of financial aid such as", metadata={'Header 1': 'Walter White', 'Header 2': 'Personality and traits', 'relevance_score': 0.9969177}),
 Document(page_content="will, and enrich himself without limit for the sake of obtaining power, even if it means to hurt or kill other people.Walt is an extremely prideful, egotistical and arrogant man who takes criticism extremely poorly and his pride blinds him to the point that he makes poor and costly decisions despite having high intelligence. Despite his massive ego, he does have genuine insecurities though he almost always refuses to acknowledge or confront them. A notable example of his fragile pride and ego is evident when he and Skyler need to buy a business to launder their drug money, Walt becomes determined to purchase the very same car wash that wounded his pride when Skyler mentions that the owner, Bogdan, insulted his manhood. Walt also refuses to let Bogdan keep his framed dollar on the wall, and out of spite he decides to use that dollar to buy a soda from the vending machine . Walt's pride, ego and arrogance is what keeps him from accepting any form of financial aid such as Gretchen and", metadata={'Header 1': 'Walter White/Personality and Traits', 'relevance_score': 0.9958597})]
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain.llms.huggingface_pipeline import HuggingFacePipeline
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig

# we can manually design our prompt for question-answering using context
# or just pull one example from langchain hub
prompt = hub.pull("rlm/rag-prompt")
print(prompt.format(question="QUESTION", context="CONTEXT"))
Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: QUESTION 
Context: CONTEXT 
Answer:

Text Generation

This is the part where we choose a model to generate the answer. We can either go with an API (like OpenAI, Gemini) or load our own model and prompt it.

We will go with the latter, just to see what the free-tier Colab can let us have.

In order to choose a model, we can go to any leaderboard like the huggingface open-llm leaderboard and browse through our requirements, including the model size.

Here, we’re going with Mistral-7B-instruct. It is the best 7B model as of now.

However, we cannot load the 7B model as it is onto the Colab GPU, because the T4 GPU does not have enough memory to hold it. Yet, there is a trick called quantization that you can use to load the model into limited memory. There are multiple ways to load a quantized model

  1. bitsandbytes : You can directly load a quantized model using bitsandbytes library.
  2. auto-gptq : This quantization method needs you to calibrate the model to input data, and this can take hours and more compute. The final model is quantized, and can be used. It’s faster than bitsandbytes.
  3. auto-awq : This is an another method for quantization, but this too requires calibration with input data. Thankfully, some models have their GPT-Q and AWQ versions uploaded with huggingface that we can directly use.

However, I was unable to get Mistral-7b AWQ to run, and Mistral-7B GPT-Q model was performing quite terribly. In the end, I go with bitsandbytes. For bitsandbytes, we can load either 8-bit or 4-bit quantized models. We go with 4-bit for better speed.

We load the models using huggingface. Langchain has a class that easily instantiates an LLM object using huggingface pipeline.

from transformers import BitsAndBytesConfig

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id,
                                                          quantization_config=double_quant_config,
                                                          device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("text-generation",
                model=model_double_quant,
                tokenizer=tokenizer,
                max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=pipe)

Chaining in Langchain

Now that we have the components that make up the system, we need to chain them together so that a query gives us the final answer. We can either do that manually - get output from first component, which is the input to the second component, etc.

Or use the Langchain’s chaining feature that lets us do that in one line of code:

# test out the retriever+prompt template for debugging
( {"context": compression_retriever, "question": RunnablePassthrough()}
 | prompt).invoke(queries_by_chatgpt[-2])
ChatPromptValue(messages=[HumanMessage(content='You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don\'t know the answer, just say that you don\'t know. Use three sentences maximum and keep the answer concise.\nQuestion: What role does redemption play in the character arcs of Breaking Bad? \nContext: [Document(page_content=\'He is shown to possess a kingpin\\\'s unbeatable survival skills: sociopathy, cunning, emotional manipulation, meticulousness, and violence - or at least the threat thereof. Bryan Cranston said by the fourth season: "I think Walt\\\'s figured out it\\\'s better to be a pursuer than the pursued. He\\\'s well on his way to badass. Over the course of the series, he\\\'s evolved as a businessman, but he\\\'s turned into a sociopath in both his personal and professional lives. He\\\'s shed basic empathy and has no idea how much his colleagues and wife loathe him." As Walt delves deeper into the criminal underworld he increasingly sees people as expendable pawns, who he either manipulates to further his interests or eliminates. Early on such as in Season 1, Walt has great difficulty bringing himself to murder, but by the end of Season 3, he barely gives killing a second thought as shown by ordering Jesse to murder Gale to ensure their own survival and later was also capable of poisoning a young child without\', metadata={\'Header 1\': \'Walter White\', \'Header 2\': \'Personality and traits\', \'relevance_score\': 0.9915188}), Document(page_content=\'willing to turn a blind eye to what crimes his clients have committed and help them to evade justice and allow them to remain free. Despite this Ed appears to have somewhat of a genuine desire to allow criminals to start a new life free of crime and give them a second chance as seen by his interactions with Jesse after extracting him to Alaska, telling them how not many of them get a chance to start fresh and also to allow Walt to spend the rest of his short life in a relaxed environment. He even provides additional serves to his clients without receiving anything in return such as mailing a letter from Jesse to Brock Cantillo, seemingly as a pure gesture of kindness. In his civilian work he is shown to be a reasonable and caring salesman willing to provide quality service to his customers, which is shown to extend to his criminal extraction services as well.\', metadata={\'Header 1\': \'Ed Galbraith\', \'Header 2\': \'Personality and traits\', \'relevance_score\': 0.74466217}), Document(page_content=\'a kingpin\\\'s unbeatable survival skills: sociopathy, cunning, emotional manipulation, meticulousness, and violence - or at least the threat thereof. Bryan Cranston said by the fourth season: "I think Walt\\\'s figured out it\\\'s better to be a pursuer than the pursued. He\\\'s well on his way to badass. Over the course of the series, he\\\'s evolved as a businessman, but he\\\'s turned into a sociopath in both his personal and professional lives. He\\\'s shed basic empathy and has no idea how much his colleagues and wife loathe him."As Walt delves deeper into the criminal underworld he increasingly sees people as expendable pawns, who he either manipulates to further his interests or eliminates. Early on such as in Season 1, Walt has great difficulty bringing himself to murder, but by the end of Season 3, he barely gives killing a second thought as shown by ordering Jesse to murder Gale to ensure their own survival and later was also capable of poisoning a young child without any remorse at all.\', metadata={\'Header 1\': \'Walter White/Personality and Traits\', \'relevance_score\': 0.07892358})] \nAnswer:')])
# create the complete chain for rag
rag_chain = ({"context": compression_retriever, "question": RunnablePassthrough()}
             | prompt
             | llm
             | StrOutputParser())

Let’s try out an example!

# try out an example
idx = 0
print(queries[idx])
rag_chain.invoke(queries[idx])
how did jane, the girlfriend of jessie, die?
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
' In the context provided, Jane is shown to have died by overdose. Walt reveals that he watched her die but did not intervene.'
idx = -9
print(queries_by_chatgpt[idx])
rag_chain.invoke(queries_by_chatgpt[idx])
Describe the key events leading to Walter White's transformation into Heisenberg.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
' In his entire adult life, Walter White had been suppressing his emotions. He begins to embrace his darker side after throwing a piece of fulminated mercury to trigger an explosion, marking his transformation into Heisenberg. This new persona makes him confident, strong, and cruel. Eventually, Walt fully transforms into the dangerous drug kingpin Heisenberg, preferring to die in a fight and leave a legacy instead of giving in.'

Is RAG Necessary

But is the context even useful? How do we know the model isn’t able to answer the question without retrieving documents?

Let’s try!

chain_without_rag =  (RunnablePassthrough()
    | llm
    | StrOutputParser()
)
chain_without_rag.invoke(queries[0])
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
'\n\nJane, the girlfriend of Jessie, was killed by the serial killer, John Doe, in the movie "Seven." She was strangled with a ligature while in the bathtub. The scene is particularly disturbing due to the fact that it is intercut with the scene of Mills and Sommers finding the body of the first victim, Tracy Miles.\n\n## Who is the killer in Seven?\n\nThe killer in the movie "Seven" is John Doe, portrayed by Kevin Spacey. He is a disturbed and intelligent individual who commits a series of gruesome murders based on the seven deadly sins. The motive behind his crimes is to punish society for its moral decay and to force detective William Somerset (Morgan Freeman) to confront his own disillusionment with the world.\n\n## What is the significance of the number seven in Seven?\n\nThe number seven is a'
(RunnablePassthrough()
    | llm
    | StrOutputParser()
).invoke(queries[1])
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
'\n\nHank Schrader is portrayed by Bryan Cranston in the television series "Breaking Bad."\n\nBryan Cranston is an American actor who has had a successful career in both film and television. He is best known for his role as Walter White, or "Heisenberg," in "Breaking Bad," for which he won numerous awards, including four Primetime Emmy Awards for Outstanding Lead Actor in a Drama Series.\n\nCranston\'s performance as Hank Schrader is also noteworthy, as he portrays a DEA agent who becomes suspicious of his brother-in-law, Walter White, and eventually sets out to bring him to justice. The complex and nuanced portrayal of Hank\'s character adds depth to the story and keeps audiences engaged throughout the series.\n\nBryan Cranston\'s acting skills and dedication to his craft have earned him a'

As you can see, the LLM is giving completely incorrect answers without RAG!

You can see how there are many inaccuracies in the answer if we do not provide any context to the model!

Hybrid Retrieval

Remember we had used BM25 retriever at the beginning, and discussed about hybrid retrieval that combines multiple retrievers together. Here, we will try ensembling our BGE-small dense retriever with BM25 retriever, and add the reranker on top of both.

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, retriever], weights=[0.5, 0.5]
)
query = queries[1]
print(query, "\n")
ensemble_retriever.invoke(query)[0].page_content
which actor plays hank schrader? 
'Jolene Purdy is the actor who plays Cara on .'

In this case, the hybrid retriever is giving us worse results than pure embeddings search.

compression_retriever_hybrid = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=ensemble_retriever
)
query = queries[1]
print(query, "\n")
compression_retriever_hybrid.invoke(query)[0].page_content
which actor plays hank schrader? 
'Dean Norris is an American actor who portrays DEA agent Hank Schrader on and .'

But after adding the reranker on the top of the hybrid retriever, we see the results are good! The choice between hybrid and pure embeddings requires more evaluation.

Improvements

This is supposed to be a basic script for RAG that covers the most important topics. If we were to improve this system further, there are many things we could try:

  • Better Queries: We can try giving harder, complex queries to understand the failure points of the system. Fixing the failures will give us a better RAG system. For example, our current system fails to answer the question ‘who directed the episode where hank schrader discovers heisenberg?’
  • Metadata: We have metadata about the markdown headers that we did not use.
  • Summarization of Articles: We could create summaries of entire documents using LLMs, which in itself could be input as documents. This will compress information and provide us with chunks that have a larger field of view.
  • More Sources: We have only used Mediawiki, but we could add the scripts of the episodes, reddit posts, forums, etc. to the database.
  • Automated Evaluation: We can ask an LLM to generate questions from a chunk, and feed that question to the RAG to see if it correctly identifies the chunk as well as the answer.
  • Advanced Techniques: There are some techniques like query expansion that can be useful for complex queries.

Finally, our choice of document source itself has limitations. Our chosen source is a very simple one, and real life use cases can have more complex documentation that would require more tuning for building a RAG system.