Tuesday, March 12, 2024

Build an Open Source Research Chat Assistant with Ollama and RAG

In my previous post titled, "Build a Chat Application with Ollama and Open Source Models", I went through the steps of how to build a Streamlit chat application that used Ollama to run the open source model Mistral locally on my machine. Refer to that post for help in setting up Ollama and Mistral. In this post, I will extend some of those ideas and show how to create a "Research Assistant" using Ollama, Mistral, RAG, LlamaIndex, and Streamlit.

This application will have two parts:

  1. Document retrieval: I will build a page that will use the arXiv repository API to pull the most relevant documents for a topic into a vector based index using LlamaIndex.
  2. Document chat: Based on all of the documents that have been pulled into the vector database, I will build a chat interface page that allows the user to chat on topics that are in the database using either Mistral or OpenAI - the user will be able to pick which LLM they want to use to chat with all of the documents that have been built up in the database.
Here are screenshots of the two pages in the application:

Document Chat


Data Acquisition (downloads data from arXiv)



But first what is RAG?

Retrieval Augmented Generated systems (RAG) are AI systems that enhance an output's relevance and accuracy by combining the strengths of large language models. The basic idea behind retrieval augmented generation is to enhance the language model's output by retrieving and incorporating relevant information from a large corpus of text or knowledge base. This approach aims to address the limitations of traditional language models, which can sometimes generate factually incorrect or inconsistent text due to their limited knowledge or understanding of the world.

The key advantages of retrieval augmented generation include:

  • Improved factual accuracy and consistency - reducing hallucinations: By incorporating relevant information from external sources, the generated text is more likely to be factually accurate and consistent with real-world knowledge.
  • Enhanced knowledge coverage: The model can leverage a vast amount of information from a knowledge base, effectively expanding its knowledge beyond what is encoded in a language model.
  • Adaptability: The retrieval can be tailored to specific domains or knowledge sources, allowing the model to generate text that is relevant and accurate for a particular domain or task.
  • Overcoming a model's training cutoff date: Language models have an effective cutoff date that they have been trained on and cannot respond accurately on events that happened after that date. By using RAG with new documents, the LLM can have access to knowledge past its cutoff date.

In this application I will use LlamaIndex to implement RAG. LlamaIndex is great at ingesting data from a wide variety of sources (PDFs, Word files, images, audio, PPT, etc.). LlamaIndex has a very convenient function called SimpleDirectoryReader that can read through all of the files in a directory and if it is one of the many files it can load it will load it. These files will be stored as vector based embeddings. From the LlamaIndex documentation:

"Embeddings are used in LlamaIndex to represent your documents using a sophisticated numerical representation. Embedding models take text as input, and return a long list of numbers used to capture the semantics of the text. These embedding models have been trained to represent text this way, and help enable many applications, including search!"

But before we can embed some documents to search and chat with, we need to get some documents for our database. This is the data acquisition page from above. On this page, the user will enter a topic such as "Mamba in AI" and the code will use the arXiv API to download the most relevant and recent PDF documents. arXiv is a repository of several million scholarly documents from everything from computer science to physics.

The code will then create embeddings for those documents and make those embeddings available to chat with. In this application, I am using OpenAI embeddings (text-embeddings-ada-002) and QDrant embeddings. The QDrant embeddings will be used with Mistral. In a real application you would most likely only be doing one kind of LLM with one kind of embedding. But for illustration purposes, I'm doing both OpenAI and Mistral. QDrant is an open source set of embeddings and the LlamaIndex documentation for QDrant can be found here. So if you want to use a completely open source version of both the LLM and the embeddings and not have to worry about token pricing then using Mistral and QDrant is one of many possible options.

The code to pull the documents for the topic from arXiv is in "research.py." It uses a python module to make using their API a little easier appropriately called arxiv.

pip install arxiv

research.py:

''' Get papers from arXiv using the arXiv API '''

def get_arxiv(query, num_documents):

  search = arxiv.Search( query = query, max_results = num_documents, sort_by = arxiv.SortCriterion.Relevance, sort_order = arxiv.SortOrder.Descending )    

  titles = [] summaries = [] authors = [] published = [] links = []

  for result in search.results():    

    titles.append(result.title) summaries.append(result.summary) authors.append(', '.join(author.name for author in result.authors)) published.append(result.published) links.append(', '.join(str(link) for link in result.links))

    result.download_pdf(dirpath="./documents")

  df = pd.DataFrame({'title': titles, 'summary': summaries, 'authors': authors, 'published': published, 'links': links})

  df = df.sort_values(by='published', ascending = False) df = df.reset_index(drop=True)

  if os.path.exists('documents.csv'):

    df.to_csv('documents.csv', mode='a', header=False, index=False)

  else:

    df.to_csv('documents.csv', index=False)    

  return df

The code uses the arxiv "Search" function to get the most relevant articles based on the user inputted topic and number of documents to retrieve and store those documents in the documents folder. I then write the meta data for the articles appending it to a "documents.csv" file. One change that you could make here is to instead store this metadata in a database.

This arxiv function is called from the Streamlit UI page: "1 - Data Acquistion.py." This Streamlit page will ask the user for a topic and the maximum number of documents to retrieve. After receiving the documents from the arxiv function, it will create the embeddings and the Llamaindex client (query_engine) in "client.py."

1 - Data Acquistion.py:

import streamlit as st import pandas as pd import os

from pages.utilities.research import get_arxiv from pages.utilities.client import get_mistral_query_engine, get_gpt_query_engine

if __name__ == "__main__":

    st.set_page_config(layout="wide") st.title('Research Assistant')

st.divider()

    with st.sidebar:

        max_documents = st.number_input("Max number of documents:", value=10)  

    topic = st.text_input('Research Topic:')

    with st.spinner('Thinking...'):

        if len(topic) > 0:                            

            get_arxiv(topic, max_documents)      

        if os.path.exists('documents.csv'):

            df = pd.read_csv("documents.csv")             df = df.drop_duplicates(subset=['title'])

            st.dataframe(df)

        if topic:

            try:                 query_engine = get_mistral_query_engine(True) query_engine = get_gpt_query_engine(True)

            except:                 pass

    

client.py:

import streamlit as st import os import qdrant_client

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core import (      load_index_from_storage, ServiceContext)

from llama_index.llms.ollama import Ollama from llama_index.core.storage.storage_context import StorageContext from llama_index.vector_stores.qdrant import QdrantVectorStore

@st.cache_resource def get_mistral_query_engine(data_changed):

    llm_model = Ollama(model="mistral") collection_name = "storage"

    if 'qdrant_client' not in st.session_state:

      st.session_state.qdrant_client = qdrant_client.QdrantClient(path="./qdrant_data")   

    if 'vector_store' not in st.session_state:

      st.session_state.vector_store = QdrantVectorStore(client=st.session_state.qdrant_client, collection_name=collection_name)

    if 'service_context' not in st.session_state:

      st.session_state.service_context = ServiceContext.from_defaults(llm=llm_model, embed_model="local")

    if 'storage_context' not in st.session_state:

      st.session_state.storage_context = StorageContext.from_defaults(vector_store=st.session_state.vector_store)

    qdrant_persist_dir= "./qdrant_data/collection/storage"

    if not os.path.exists(qdrant_persist_dir) or data_changed:                      

documents = SimpleDirectoryReader("documents").load_data()

      index = VectorStoreIndex.from_documents(documents,                                    service_context=st.session_state.service_context,                                    storage_context=st.session_state.storage_context)

      index.storage_context.persist()     

    else:

      index = VectorStoreIndex.from_vector_store(vector_store=st.session_state.vector_store, service_context=st.session_state.service_context)

    query_engine = index.as_query_engine(streaming=False)

    return query_engine

@st.cache_resource def get_gpt_query_engine(data_changed):

  gpt_persist_dir = "./storage"

  if not os.path.exists(gpt_persist_dir) or data_changed:  

    documents = SimpleDirectoryReader("documents").load_data() index = VectorStoreIndex.from_documents(documents) index.storage_context.persist()

  else:

    storage_context = StorageContext.from_defaults(persist_dir=gpt_persist_dir) index = load_index_from_storage(storage_context)

  query_engine = index.as_query_engine() 

  return query_engine

There are two functions: one for GPT (get_gpt_query_engine) and one for Mistral/QDrant (get_mistral_query_engine). The GPT function is the more straightforward of the two. If the data has changed (which it has when a user has gotten more documents with a new topic), SimpleDirectoryReader will read all of the documents in the "documents" folder, index them, and persist that vector database to the "storage" folder. If the data has not changed, it will not try and re-index all of the documents again, but use the existing vector database. We only want to re-index when necessary - saving time and tokens.

For Mistral/QDrant we follow the same pattern with a couple of exceptions. One, we will be saving the indexed data in a "qdant_data" folder. Second, we will be using QDrantClient and QDrantVectorStore to be doing our vector embeddings. The code around "data_changed" is important because we only want to get one instance of the QDrantClient per session - if we don't we will get errors.

Both functions return a "query_engine" that can be used the same way to chat with our vector database of documents.

In our "main" program that has the chat interface called "Reseach Assistant.py" we can use "query_engine."

Research Assistant.py

import streamlit as st

from pages.utilities.client import get_mistral_query_engine, get_gpt_query_engine

if 'message_list' not in st.session_state:
  st.session_state.message_list = []    
      
if __name__ == "__main__":

    st.title('Research Assistant')
    st.divider()
    
    with st.sidebar:
      
      st.markdown('# Models')
      
      selected_model = st.selectbox('s', ['Mistral', 'GPT-4'], label_visibility='hidden')

    message = st.chat_message("assistant")
    message.write("Hello human!")
    
    prompt = st.chat_input("Ask a question")
    
    try:
      if selected_model == 'Mistral':
        query_engine = get_mistral_query_engine(False)
      else:
        query_engine = get_gpt_query_engine(False)
    except:
        with st.chat_message("assistant"):
          st.write(str("Error loading model. Please try again."))
          st.stop()
        
    for l in st.session_state.message_list:
                
      if 'user' in l:
        with st.chat_message("user"):
          st.write(l['user'])
      if 'assistant' in l:
        with st.chat_message("assistant"):
          st.write(l['assistant'])
        
    if prompt:
            
      with st.spinner('Thinking...'):

        response = query_engine.query(prompt)
                
        with st.chat_message("user"):
          st.write(prompt)
        with st.chat_message("assistant"):
          st.write(str(response))

        a = {
          "user": prompt,
          "assistant": str(response)         
        }
            
        st.session_state.message_list.append(a)

We can call query_engine.query with the user prompt, i.e. the question the user wants to ask of the database and it will use the chosen LLM/vector database to answer the question.

Let's try it out!

In the data acquistion page I've asked it to get documents related to Mamba in AI and separately to get documents related to QLoRa - both are relatively new topics in AI.


As you can see, it used information from our database and not information it had been trained on. We know this because Mamba came out in December of 2023 - after the training cutoff for both models.

Let's ask it a very specific question that our database knows about:


This is answer is not only correct, but it's about a paper that came out in January of 2024.

Even though it does everything we set out to do, there are several improvements that could be done to this application. For example, beyond the excellent ideas laid out in the LlamaIndex documentation, one obvious improvement would be to not re-index all of the documents each time a user enters a new topic and gets a new batch of documents, but instead index just the new documents and add those to the existing embeddings. Furthermore, you could store the meta data in a database and not in a csv file and also instead of pulling documents from arXiv you could have a drag and drop file dialog box that allows the user to add their own documents.

All of the code for this application can be found at this Github repository. Because this is a Streamlit application that uses pages, the directory structure is important and that structure can be seen in the repository. I also included in the repository a notes.md that includes the pip installs that need to be done.

And that's it!

We now have a Streamlit application that can retrieve documents on topics, build up a vector based embedding database, and allow us to chat with those topics in that database.

Elements of Monte Carlo Tree Search - Typical and Non-typical Applications

Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the...