Ask your Bitbucket: RAG with LlamaIndex and Bitbucket

Introduction

In this article, I will present to you Ask Your Bitbucket, a POC (Proof-of-Concept) application that can help us developers searching for information in a repository. Have you heard of Ask Your PDF, Ask Your Notion, etc.? This is Ask Your Bitbucket.

I will walk you through the steps I took to build this tool. I have utilized OpenAI and LlamaIndex models with Retrieval-Augmented Generation (RAG).

What is LlamaIndex?

LLMs, or Large Language Models, serve as a natural language interface between users and data, leveraging pre-training on extensive publicly available datasets. However, these models are not trained on specific or private data, often stored in APIs, SQL databases, PDFs, or other formats. LlamaIndex addresses this issue by connecting to these sources and integrating user-specific data into LLMs, a process known as RAG.

What is RAG?

Retrieval-Augmented Generation (RAG) refers to a hybrid approach in natural language processing (NLP) that combines elements of both retrieval-based models and generative models. This methodology aims to leverage the strengths of each approach to enhance the overall performance in generating human-like and contextually relevant responses.

RAG allows users to query, transform, and gain insights from their data using LLMs. This technology enables tasks such as asking data-related questions, creating chatbots, building semi-autonomous agents, and more.

What is BitbucketReader?

BitbucketReader is a data loader that I previously created to be used with LlamaIndex.

To retrieve the contents of every file in a repository, the BitbucketReader makes a request to the Bitbucket API. As such, in order to use the BitbucketReader, you require a Bitbucket API Key. Next, it converts the text to a Node as LlamaIndex demands.

I modified the loader that I had previously sent to LlamaHub. With the help of this new functionality, you can now choose which repository you wish to query. It loads the contents of all repositories it locates in a single project if you don’t specify one.

Review my PR to LlamaHub in the following link: https://github.com/run-llama/llama-hub/pull/781

				
					"""bitbucket reader"""

from typing import List, Optional

import base64
import os
import requests
from llama_index.readers.base import BaseReader
from llama_index.readers.schema.base import Document


class BitbucketReader(BaseReader):
    """Bitbucket reader.

    Reads the content of files in Bitbucket repositories.

    """

    def __init__(
        self,
        base_url: Optional[str] = None,
        project_key: Optional[str] = None,
        branch: Optional[str] = "refs/heads/develop",
        repository: Optional[str] = None,
        extensions_to_skip: Optional[List] = [],
    ) -> None:
        """Initialize with parameters."""
        if os.getenv("BITBUCKET_USERNAME") is None:
            raise ValueError("Could not find a Bitbucket username.")
        if os.getenv("BITBUCKET_API_KEY") is None:
            raise ValueError("Could not find a Bitbucket api key.")
        if base_url is None:
            raise ValueError("You must provide a base url for Bitbucket.")
        if project_key is None:
            raise ValueError("You must provide a project key for Bitbucket repository.")
        self.base_url = base_url
        self.project_key = project_key
        self.branch = branch
        self.extensions_to_skip = extensions_to_skip
        self.repository = repository

    def get_headers(self):
        username = os.getenv("BITBUCKET_USERNAME")
        api_token = os.getenv("BITBUCKET_API_KEY")
        auth = base64.b64encode(f"{username}:{api_token}".encode()).decode()
        return {"Authorization": f"Basic {auth}"}

    def get_slugs(self) -> List:
        """
        Get slugs of the specific project.
        """
        slugs = []
        if (self.repository is None):
          repos_url = (
              f"{self.base_url}/rest/api/latest/projects/{self.project_key}/repos/"
          )
          headers = self.get_headers()
          
          response = requests.get(repos_url, headers=headers)

          if response.status_code == 200:
              repositories = response.json()["values"]
              for repo in repositories:
                  repo_slug = repo["slug"]
                  slugs.append(repo_slug)
        slugs.append(self.repository)
        return slugs

    def load_all_file_paths(self, slug, branch, directory_path="", paths=[]):
        """
        Go inside every file that is present in the repository and get the paths for each file
        """
        content_url = f"{self.base_url}/rest/api/latest/projects/{self.project_key}/repos/{slug}/browse/{directory_path}"

        query_params = {
            "at": branch,
        }
        headers = self.get_headers()
        response = requests.get(content_url, headers=headers, params=query_params)
        response = response.json()
        if "errors" in response:
            raise ValueError(response["errors"])
        children = response["children"]
        for value in children["values"]:
            if value["type"] == "FILE":
              if value["path"]["extension"] not in self.extensions_to_skip:
                  paths.append(
                      {
                          "slug": slug,
                          "path": f'{directory_path}/{value["path"]["toString"]}',
                      }
                  )
            elif value["type"] == "DIRECTORY":
                self.load_all_file_paths(
                    slug=slug,
                    branch=branch,
                    directory_path=f'{directory_path}/{value["path"]["toString"]}',
                    paths=paths,
                )

    def load_text_by_paths(self, slug, file_path, branch) -> List:
        """
        Go inside every file that is present in the repository and get the paths for each file
        """
        content_url = f"{self.base_url}/rest/api/latest/projects/{self.project_key}/repos/{slug}/browse{file_path}"

        query_params = {
            "at": branch,
        }
        headers = self.get_headers()
        response = requests.get(content_url, headers=headers, params=query_params)
        children = response.json()
        if "errors" in children:
            raise ValueError(children["errors"])
        if "lines" in children:
            return children["lines"]
        return []

    def load_text(self, paths) -> List:
        text_dict = []
        for path in paths:
            lines_list = self.load_text_by_paths(
                slug=path["slug"], file_path=path["path"], branch=self.branch
            )
            concatenated_string = ""

            for line_dict in lines_list:
                text = line_dict.get("text", "")
                concatenated_string = concatenated_string + " " + text

            text_dict.append(concatenated_string)
        return text_dict

    def load_data(self) -> List[Document]:
        """Return a list of Document made of each file in Bitbucket."""
        slugs = self.get_slugs()
        paths = []
        for slug in slugs:
            self.load_all_file_paths(
                slug=slug, branch=self.branch, directory_path="", paths=paths
            )
        texts = self.load_text(paths)
        print(texts)
        return [Document(text=text) for text in texts]

POC development

Let’s get started with developing a POC application. I will be using Google Colab environment to build this tool. First of all, we must install a couple of dependencies, such as llama_index and openai.

				
					!pip install openai llama_index

Now, we need to define the OpenAI API key, Bitbucket API key as well as the Bitbucket username to be used for calling the Bitbucket API.

				
					import os
os.environ["OPENAI_API_KEY"] = 'sk-x'
os.environ['BITBUCKET_USERNAME'] = 'lejdiprifti'
os.environ['BITBUCKET_API_KEY'] = 'xxxxxxxxxxxxxx'

Next step is to create the loader and for this we will use BitbucketReader, which reads from the base_url that is the url of the Bitbucket your company or you uses. project_key is as the name suggests, the key of the project. We specify the branch to read from and define the extensions to skip from being read.

				
					from llama_index import VectorStoreIndex
base_url = "https://server.company.org/bitbucket"
project_key = 'ASKYOURBITBUCKET'

loader = BitbucketReader(base_url=base_url, project_key=project_key, branch='refs/heads/develop', extensions_to_skip=['gitkeep'], repository = 'ms-spring-messaging')
documents = loader.load_data()

What we need to do now is to create a service context, initialize an OpenAI-based Large Language Model (LLM), and set up a Vectore Store Index.

An instance of the OpenAI LLM is created with specific parameters:

temperature: A parameter that controls the randomness of the generated responses. Higher values make the responses more random.
model: Specifies the version of the GPT-3 model to be used, in this case, “gpt-3.5-turbo.”

A service context is established with default settings, incorporating the OpenAI LLM and setting the chunk size to 512. The service context includes configuration details for interaction with the LLM.

The index is initialized with a set of documents, which were defined above.

The query engine will be used to interact with and query the index for information.

				
					from llama_index import ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(temperature=5, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=512)
index = VectorStoreIndex.from_documents(documents=documents, service_context=service_context)
query_engine = index.as_query_engine();

By using the query_engine, we can query the documents in our index.

				
					response = query_engine.query("How is the websocket configuration done?")
print(response.response)

Are you curious about the response? This is what Ask Your Bitbucket had to say.

The websocket configuration is done by creating a class called SocketConfig that implements the WebSocketMessageBrokerConfigurer interface. This class is annotated with @Configuration and @EnableWebSocketMessageBroker. Inside the SocketConfig class, the configureMessageBroker method is overridden to enable a simple broker for "/topic" destinations and set the application destination prefix to "/app". The registerStompEndpoints method is also overridden to register the "/websocket" endpoint with SockJS.

Note: You can freely use the material in this page, but remember to cite the author.

Lejdi Prifti

Lejdi Prifti

Residence:

City:

Email:

English

Italian

French

Spring

React & Angular

Machine Learning

Docker & Kubernetes

AWS & Cloud

Team Player

Communication

Time Management

Ask your Bitbucket: RAG with LlamaIndex and Bitbucket

Introduction

What is LlamaIndex?

What is RAG?

What is BitbucketReader?

POC development

Lejdi Prifti

Lejdi Prifti

Residence:

City:

Email:

English

Italian

French

Spring

React & Angular

Machine Learning

Docker & Kubernetes

AWS & Cloud

Team Player

Communication

Time Management

Ask your Bitbucket: RAG with LlamaIndex and Bitbucket

Introduction

What is LlamaIndex?

What is RAG?

What is BitbucketReader?

POC development

Subscribe To My Weekly Newsletter