Spaces:

NotShrirang
/

LoomRAG

Sleeping

App Files Files Community

NotShrirang commited on Jan 3

Commit

a409078

1 Parent(s): 498e5e4

feat: add more sources for data

Browse files

Files changed (9) hide show

README.md +30 -7
data_upload/data_upload_page.py +10 -34
data_upload/input_sources_utils/image_util.py +46 -0
data_upload/input_sources_utils/pdf_util.py +22 -0
data_upload/input_sources_utils/text_util.py +24 -0
data_upload/input_sources_utils/website_util.py +56 -0
requirements.txt +3 -0
vectordb.py +19 -3
weights/adapter_model.pt +3 -0

README.md CHANGED Viewed

@@ -21,21 +21,24 @@ short_description: 🧠 Multimodal RAG that "weaves" together text and images
 ![GitHub](https://img.shields.io/github/license/NotShrirang/LoomRAG)
 ![GitHub last commit](https://img.shields.io/github/last-commit/NotShrirang/LoomRAG)
 ![GitHub repo size](https://img.shields.io/github/repo-size/NotShrirang/LoomRAG)
-<a href="https://loomrag.streamlit.app/"><img src="https://img.shields.io/badge/Streamlit%20App-red?style=flat-rounded-square&logo=streamlit&labelColor=white"/></a>
-This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named **LoomRAG**, that leverages OpenAI's CLIP model for neural cross-modal retrieval and semantic search. The system allows users to input text queries and retrieve both text and image responses seamlessly through vector embeddings. It also supports uploading images and PDFs for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.
 Experience the project in action:
-[![LoomRAG Streamlit App](https://img.shields.io/badge/Streamlit%20App-red?style=for-the-badge&logo=streamlit&labelColor=white)](https://loomrag.streamlit.app/)
 ---
 ## 📸 Implementation Screenshots
-| ![Screenshot 2024-12-30 111906](https://github.com/user-attachments/assets/13c0bd0d-1569-4d9e-aae5-ea5801a69beb) | ![Screenshot 2024-12-30 114200](https://github.com/user-attachments/assets/d74e9d75-7716-4705-9564-0c6fdc26790b) |
 | ---------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
-| Screenshot 1                                                                                                     | Screenshot 2                                                                                                     |
 ---
@@ -46,6 +49,9 @@ Experience the project in action:
 - 📤 **Upload Options**: Allows users to upload images and PDFs for AI-powered processing and retrieval
 - 🧠 **Embedding-Based Search**: Uses OpenAI's CLIP model to align text and image embeddings in a shared latent space
 - 🔍 **Augmented Text Generation**: Enhances text results using LLMs for contextually rich outputs
 ---
@@ -63,10 +69,23 @@ Experience the project in action:
    - The system performs a nearest neighbor search in the vector database to retrieve relevant text and images
 3. **Response Generation**:
    - For text results: Optionally refined or augmented using a language model
    - For image results: Directly returned or enhanced with image captions
    - For PDFs: Extracts text content and provides relevant sections
 ---
 ## 🚀 Installation
@@ -98,6 +117,9 @@ Experience the project in action:
    - Access the interface in your browser to:
      - Submit natural language queries
      - Upload images or PDFs to retrieve contextually relevant results
 2. **Example Queries**:
    - **Text Query**: "sunset over mountains"
@@ -112,12 +134,13 @@ Experience the project in action:
 - 📊 **Vector Database**: It uses FAISS for efficient similarity search
 - 🤖 **Model**: Uses OpenAI CLIP for neural embedding generation
 - ✍️ **Augmentation**: Optional LLM-based augmentation for text responses
 ---
 ## 🗺️ Roadmap
-- [ ] Fine-tuning CLIP for domain-specific datasets
 - [ ] Adding support for audio and video modalities
 - [ ] Improving the re-ranking system for better contextual relevance
 - [ ] Enhanced PDF parsing with semantic section segmentation
@@ -132,7 +155,7 @@ Contributions are welcome! Please open an issue or submit a pull request for any
 ## 📄 License
-This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
 ---

 ![GitHub](https://img.shields.io/github/license/NotShrirang/LoomRAG)
 ![GitHub last commit](https://img.shields.io/github/last-commit/NotShrirang/LoomRAG)
 ![GitHub repo size](https://img.shields.io/github/repo-size/NotShrirang/LoomRAG)
+<a href="https://huggingface.co/spaces/NotShrirang/LoomRAG"><img src="https://img.shields.io/badge/Streamlit%20App-red?style=flat-rounded-square&logo=streamlit&labelColor=white"/></a>
+This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named **LoomRAG**, that leverages OpenAI's CLIP model for neural cross-modal retrieval and semantic search. The system allows users to input text queries and retrieve both text and image responses seamlessly through vector embeddings. It features a comprehensive annotation interface for creating custom datasets and supports CLIP model fine-tuning with configurable parameters for domain-specific applications. The system also supports uploading images and PDFs for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.
 Experience the project in action:
+[![LoomRAG Streamlit App](https://img.shields.io/badge/Streamlit%20App-red?style=for-the-badge&logo=streamlit&labelColor=white)](https://huggingface.co/spaces/NotShrirang/LoomRAG)
 ---
 ## 📸 Implementation Screenshots
+| ![Screenshot 2025-01-01 184852](https://github.com/user-attachments/assets/ad79d0f0-d200-4a82-8c2f-0890a9fe8189) | ![Screenshot 2025-01-01 222334](https://github.com/user-attachments/assets/7307857d-a41f-4f60-8808-00d6db6e8e3e) |
 | ---------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
+| Data Upload Page                                                                                                 | Data Search / Retrieval                                                                                          |
+|                                                                                                                  |                                                                                                                  |
+| ![Screenshot 2025-01-01 222412](https://github.com/user-attachments/assets/e38273f4-426b-444d-80f0-501fa9563779) | ![Screenshot 2025-01-01 223948](https://github.com/user-attachments/assets/21724a92-ef79-44ae-83e6-25f8de29c45a) |
+| Data Annotation Page                                                                                             | CLIP Fine-Tuning                                                                                                 |
 ---
 - 📤 **Upload Options**: Allows users to upload images and PDFs for AI-powered processing and retrieval
 - 🧠 **Embedding-Based Search**: Uses OpenAI's CLIP model to align text and image embeddings in a shared latent space
 - 🔍 **Augmented Text Generation**: Enhances text results using LLMs for contextually rich outputs
+- 🏷️ **Image Annotation**: Enables users to annotate uploaded images through an intuitive interface
+- 🎯 **CLIP Fine-Tuning**: Supports custom model training with configurable parameters including test dataset split size, learning rate, optimizer, and weight decay
+- 🔨 **Fine-Tuned Model Integration**: Seamlessly load and utilize fine-tuned CLIP models for enhanced search and retrieval
 ---
    - The system performs a nearest neighbor search in the vector database to retrieve relevant text and images
 3. **Response Generation**:
    - For text results: Optionally refined or augmented using a language model
    - For image results: Directly returned or enhanced with image captions
    - For PDFs: Extracts text content and provides relevant sections
+4. **Image Annotation**:
+   - Dedicated annotation page for managing uploaded images
+   - Support for creating and managing multiple datasets simultaneously
+   - Flexible annotation workflow for efficient data labeling
+   - Dataset organization and management capabilities
+5. **Model Fine-Tuning**:
+   - Custom CLIP model training on annotated images
+   - Configurable training parameters for optimization
+   - Integration of fine-tuned models into the search pipeline
 ---
 ## 🚀 Installation
    - Access the interface in your browser to:
      - Submit natural language queries
      - Upload images or PDFs to retrieve contextually relevant results
+     - Annotate uploaded images
+     - Fine-tune CLIP models with custom parameters
+     - Use fine-tuned models for improved search results
 2. **Example Queries**:
    - **Text Query**: "sunset over mountains"
 - 📊 **Vector Database**: It uses FAISS for efficient similarity search
 - 🤖 **Model**: Uses OpenAI CLIP for neural embedding generation
 - ✍️ **Augmentation**: Optional LLM-based augmentation for text responses
+- 🎛️ Fine-Tuning: Configurable parameters for model training and optimization
 ---
 ## 🗺️ Roadmap
+- [x] Fine-tuning CLIP for domain-specific datasets
 - [ ] Adding support for audio and video modalities
 - [ ] Improving the re-ranking system for better contextual relevance
 - [ ] Enhanced PDF parsing with semantic section segmentation
 ## 📄 License
+This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.
 ---

data_upload/data_upload_page.py CHANGED Viewed

@@ -1,44 +1,20 @@
 import os
 import streamlit as st
 import sys
-from vectordb import add_image_to_index, add_pdf_to_index
 sys.path.append(os.path.dirname(os.path.abspath(__file__)))
 def data_upload(clip_model, preprocess, text_embedding_model):
     st.title("Data Upload")
-    upload_choice = st.selectbox(options=["Upload Image", "Upload PDF"], label="Select Upload Type")
     if upload_choice == "Upload Image":
-        st.subheader("Add Image to Database")
-        images = st.file_uploader("Upload Image", type=["jpg", "jpeg", "png"], accept_multiple_files=True)
-        if images:
-            cols = st.columns(5, vertical_alignment="center")
-            for count, image in enumerate(images[:4]):
-                with cols[count]:
-                    st.image(image)
-            with cols[4]:
-                if len(images) > 5:
-                    st.info(f"and more {len(images) - 5} images...")
-            st.info(f"Total {len(images)} files selected.")
-            if st.button("Add Images"):
-                progress_bar = st.progress(0)
-                for image in images:
-                    add_image_to_index(image, clip_model, preprocess)
-                    progress_bar.progress((images.index(image) + 1) / len(images), f"{images.index(image) + 1}/{len(images)}")
-                st.success("Images Added to Database")
-    else:
-        st.subheader("Add PDF to Database")
-        st.warning("Please note that the images in the PDF will also be extracted and added to the database.")
-        pdfs = st.file_uploader("Upload PDF", type=["pdf"], accept_multiple_files=True)
-        if pdfs:
-            st.info(f"Total {len(pdfs)} files selected.")
-            if st.button("Add PDF"):
-                for pdf in pdfs:
-                    add_pdf_to_index(
-                        pdf=pdf,
-                        clip_model=clip_model,
-                        preprocess=preprocess,
-                        text_embedding_model=text_embedding_model,
-                    )
-                st.success("PDF Added to Database")

 import os
 import streamlit as st
 import sys
+from data_upload.input_sources_utils import image_util, pdf_util, website_util
 sys.path.append(os.path.dirname(os.path.abspath(__file__)))
 def data_upload(clip_model, preprocess, text_embedding_model):
     st.title("Data Upload")
+    upload_choice = st.selectbox(options=["Upload Image", "Add Image from URL / Link", "Upload PDF", "Website Link"], label="Select Upload Type")
     if upload_choice == "Upload Image":
+        image_util.upload_image(clip_model, preprocess)
+    elif upload_choice == "Add Image from URL / Link":
+        image_util.image_from_url(clip_model, preprocess)
+    elif upload_choice == "Upload PDF":
+        pdf_util.upload_pdf(clip_model, preprocess, text_embedding_model)
+    elif upload_choice == "Website Link":
+        website_util.data_from_website(clip_model, preprocess, text_embedding_model)

data_upload/input_sources_utils/image_util.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import os
+import requests
+import streamlit as st
+import sys
+from vectordb import add_image_to_index, add_pdf_to_index
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+def image_from_url(clip_model, preprocess):
+    st.title("Image from URL")
+    url = st.text_input("Enter Image URL")
+    correct_url = False
+    if url:
+        try:
+            st.image(url)
+            correct_url = True
+        except:
+            st.error("Invalid URL")
+            correct_url = False
+        if correct_url:
+            if st.button("Add Image"):
+                response = requests.get(url)
+                if response.status_code == 200:
+                    add_image_to_index(response.content, clip_model, preprocess)
+                    st.success("Image Added to Database")
+                else:
+                    st.error("Invalid URL")
+def upload_image(clip_model, preprocess):
+    st.subheader("Add Image to Database")
+    images = st.file_uploader("Upload Image", type=["jpg", "jpeg", "png"], accept_multiple_files=True)
+    if images:
+        cols = st.columns(5, vertical_alignment="center")
+        for count, image in enumerate(images[:4]):
+            with cols[count]:
+                st.image(image)
+        with cols[4]:
+            if len(images) > 5:
+                st.info(f"and more {len(images) - 5} images...")
+        st.info(f"Total {len(images)} files selected.")
+        if st.button("Add Images"):
+            progress_bar = st.progress(0)
+            for image in images:
+                add_image_to_index(image, clip_model, preprocess)
+                progress_bar.progress((images.index(image) + 1) / len(images), f"{images.index(image) + 1}/{len(images)}")
+            st.success("Images Added to Database")

data_upload/input_sources_utils/pdf_util.py ADDED Viewed

	@@ -0,0 +1,22 @@

+import os
+import streamlit as st
+import sys
+from vectordb import add_image_to_index, add_pdf_to_index
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+def upload_pdf(clip_model, preprocess, text_embedding_model):
+    st.subheader("Add PDF to Database")
+    st.warning("Please note that the images in the PDF will also be extracted and added to the database.")
+    pdfs = st.file_uploader("Upload PDF", type=["pdf"], accept_multiple_files=True)
+    if pdfs:
+        st.info(f"Total {len(pdfs)} files selected.")
+        if st.button("Add PDF"):
+            for pdf in pdfs:
+                add_pdf_to_index(
+                    pdf=pdf,
+                    clip_model=clip_model,
+                    preprocess=preprocess,
+                    text_embedding_model=text_embedding_model,
+                )
+            st.success("PDF Added to Database")

data_upload/input_sources_utils/text_util.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import bs4
+import os
+from langchain_text_splitters import CharacterTextSplitter
+import requests
+import streamlit as st
+import sys
+from vectordb import add_image_to_index, add_pdf_to_index, update_vectordb
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+def process_text(text: str, text_embedding_model):
+    text_splitter = CharacterTextSplitter(
+        separator="\n",
+        chunk_size=1200,
+        chunk_overlap=200,
+        length_function=len,
+        is_separator_regex=False,
+    )
+    chunks = text_splitter.split_text(text)
+    text_embeddings = text_embedding_model.encode(chunks)
+    for chunk, embedding in zip(chunks, text_embeddings):
+        index = update_vectordb(index_path="text_index.index", embedding=embedding, text_content=chunk)
+    return index

data_upload/input_sources_utils/website_util.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import bs4
+import os
+import requests
+import streamlit as st
+import sys
+from vectordb import add_image_to_index, add_pdf_to_index
+from data_upload.input_sources_utils import text_util
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+def data_from_website(clip_model, preprocess, text_embedding_model):
+    st.title("Data from Website")
+    website_url = st.text_input("Enter Website URL")
+    if website_url:
+        st.write(f"URL: {website_url}")
+        if st.button("Extract and Add Data"):
+            response = requests.get(website_url)
+            if response.status_code == 200:
+                st.success("Data Extracted Successfully")
+            else:
+                st.error("Invalid URL")
+            soup = bs4.BeautifulSoup(response.content, features="lxml")
+            images = soup.find_all("img")
+            image_dict = []
+            if not images:
+                st.info("No Images Found!")
+            else:
+                st.info(f"Found {len(images)} Images")
+                progress_bar = st.progress(0, f"Extracting Images... | 0/{len(images)}")
+                cols = st.columns(5)
+                for count, image in enumerate(images):
+                    try:
+                        image_url = image["src"].replace("//", "https://")
+                        response = requests.get(image_url)
+                        if response.status_code == 200:
+                            image_dict.append({"src": image_url, "content": response.content})
+                            add_image_to_index(response.content, clip_model, preprocess)
+                            len_image_dict = len(image_dict)
+                            if len_image_dict <= 4:
+                                with cols[len_image_dict - 1]:
+                                    st.image(image_url, caption=image_url, use_container_width=True)
+                            elif len_image_dict == 5:
+                                with cols[4]:
+                                    st.info(f"and more {len(images) - 4} images...")
+                    except:
+                        pass
+                    progress_bar.progress((count + 1) / len(images), f"Extracting Images... | {count + 1}/{len(images)}")
+                progress_bar.empty()
+            main_content = soup.find('main')
+            sample_text = main_content.text.strip().replace(r'\n', '')
+            with st.spinner("Processing Text..."):
+                text_util.process_text(main_content.text, text_embedding_model)
+            st.success("Data Added to Database")

requirements.txt CHANGED Viewed

@@ -6,6 +6,7 @@ annotated-types==0.7.0
 anyio==4.7.0
 async-timeout==4.0.3
 attrs==24.3.0
 blinker==1.9.0
 cachetools==5.5.0
 certifi==2024.12.14
@@ -47,6 +48,7 @@ langchain-core==0.3.28
 langchain-experimental==0.3.4
 langchain-text-splitters==0.3.4
 langsmith==0.1.147
 markdown-it-py==3.0.0
 MarkupSafe==3.0.2
 marshmallow==3.23.2
@@ -91,6 +93,7 @@ sentence-transformers==3.3.1
 six==1.17.0
 smmap==5.0.1
 sniffio==1.3.1
 SQLAlchemy==2.0.36
 streamlit==1.41.1
 streamlit-option-menu==0.4.0

 anyio==4.7.0
 async-timeout==4.0.3
 attrs==24.3.0
+beautifulsoup4==4.12.3
 blinker==1.9.0
 cachetools==5.5.0
 certifi==2024.12.14
 langchain-experimental==0.3.4
 langchain-text-splitters==0.3.4
 langsmith==0.1.147
+lxml==5.1.0
 markdown-it-py==3.0.0
 MarkupSafe==3.0.2
 marshmallow==3.23.2
 six==1.17.0
 smmap==5.0.1
 sniffio==1.3.1
+soupsieve==2.6
 SQLAlchemy==2.0.36
 streamlit==1.41.1
 streamlit-option-menu==0.4.0

vectordb.py CHANGED Viewed

@@ -57,7 +57,10 @@ def update_vectordb(index_path: str, embedding: torch.Tensor, image_path: str =
 def add_image_to_index(image, model: clip.model.CLIP, preprocess):
-    image_name = image.name
     image_name = image_name.replace(" ", "_")
     os.makedirs("./images", exist_ok=True)
     os.makedirs("./vectorstore", exist_ok=True)
@@ -65,7 +68,10 @@ def add_image_to_index(image, model: clip.model.CLIP, preprocess):
         try:
             f.write(image.read())
         except:
-            image = io.BytesIO(image.data)
             f.write(image.read())
     image = Image.open(f"./images/{image_name}")
     with torch.no_grad():
@@ -106,7 +112,7 @@ def add_pdf_to_index(pdf, clip_model: clip.model.CLIP, preprocess, text_embeddin
         pdf_texts.append(page_text)
         if page_text != "" or page_text.strip() != "":
             chunks = text_splitter.split_text(page_text)
-            text_embeddings: torch.Tensor = text_embedding_model.encode(chunks)
             for i, chunk in enumerate(chunks):
                 update_vectordb(index_path="text_index.index", embedding=text_embeddings[i], text_content=chunk)
                 pdf_pages_data.append({f"page_number": page_num, "content": chunk, "type": "text"})
@@ -114,6 +120,16 @@ def add_pdf_to_index(pdf, clip_model: clip.model.CLIP, preprocess, text_embeddin
         progress_bar.progress(percent_complete, f"Processing Page {page_num + 1}/{len(pdf_reader.pages)}")
     return pdf_pages_data
 def search_image_index(text_input: str, index: faiss.IndexFlatL2, clip_model: clip.model.CLIP, k: int = 3):
     with torch.no_grad():

 def add_image_to_index(image, model: clip.model.CLIP, preprocess):
+    if hasattr(image, "name"):
+        image_name = image.name
+    else:
+        image_name = f"{time.time()}.png"
     image_name = image_name.replace(" ", "_")
     os.makedirs("./images", exist_ok=True)
     os.makedirs("./vectorstore", exist_ok=True)
         try:
             f.write(image.read())
         except:
+            if hasattr(image, "data"):
+                image = io.BytesIO(image.data)
+            else:
+                image = io.BytesIO(image)
             f.write(image.read())
     image = Image.open(f"./images/{image_name}")
     with torch.no_grad():
         pdf_texts.append(page_text)
         if page_text != "" or page_text.strip() != "":
             chunks = text_splitter.split_text(page_text)
+            text_embeddings = text_embedding_model.encode(chunks)
             for i, chunk in enumerate(chunks):
                 update_vectordb(index_path="text_index.index", embedding=text_embeddings[i], text_content=chunk)
                 pdf_pages_data.append({f"page_number": page_num, "content": chunk, "type": "text"})
         progress_bar.progress(percent_complete, f"Processing Page {page_num + 1}/{len(pdf_reader.pages)}")
     return pdf_pages_data
+def search_image_index_with_image(image_features, index: faiss.IndexFlatL2, clip_model: clip.model.CLIP, k: int = 3):
+    with torch.no_grad():
+        distances, indices = index.search(image_features.cpu().numpy(), k)
+        return indices
+def search_text_index_with_image(text_embeddings, index: faiss.IndexFlatL2, text_embedding_model: SentenceTransformer, k: int = 3):
+    distances, indices = index.search(text_embeddings, k)
+    return indices
 def search_image_index(text_input: str, index: faiss.IndexFlatL2, clip_model: clip.model.CLIP, k: int = 3):
     with torch.no_grad():

weights/adapter_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:64a44152945986519fcaae3d8aa16d0000c4e2b7743992c5e5d35136c89e3dc1
+size 7876804