title: Chatbot-with-MaritacaAI-for-PDFs
emoji: π
colorFrom: indigo
colorTo: blue
sdk: streamlit
sdk_version: 1.28.0
app_file: app.py
pinned: false
Chatbot with MaritacaAI for PDFs
This project implements a conversational Retrieval-Augmented Generation (RAG) system using Streamlit, LangChain, and large language models from MaritacaAI - a Brazilian startup focused on specializing language models for specific domains and languages - specialized in Brazilian Portuguese. The application allows users to upload PDF documents, ask questions about their content, and maintain a chat history for context in ongoing conversations.
Author
Reinaldo Chaves ([email protected])
Features
- Streamlit user interface with dark theme and responsive layout
- Upload and processing of multiple PDF files
- Document processing using LangChain and FAISS
- Answer generation using MaritacaAI's sabia-3 model specialized in Brazilian Portuguese
- Text embeddings using Hugging Face's all-MiniLM-L6-v2 model
- Persistent chat history to maintain conversation context
- Sidebar with important user guidelines
- Token count per response
- Special formatting for legal documents and FOI (Freedom of Information) requests
Requirements
- Python 3.7+
- Streamlit
- LangChain
- FAISS
- PyPDF2
- MaritalkAI
- HuggingFace Embeddings
- Other dependencies listed in
requirements.txt
Installation
Clone this repository:
git clone https://github.com/reichaves/chatbotmaritacaai.git cd chatbotmaritacaai
Install dependencies:
pip install streamlit langchain langchain_huggingface maritalk faiss-cpu tenacity cachetools
Configure the necessary API keys:
- Maritaca AI API key (https://plataforma.maritaca.ai/)
- Hugging Face API token (https://huggingface.co./docs/hub/security-tokens)
Usage
Run the Streamlit application:
streamlit run app.py
Open your browser and access the local address shown in the terminal.
Enter your API keys when prompted.
Upload one or more PDF files.
Ask questions about the documents' content in the text input box.
How it Works
- Document Upload: Users upload PDF files, which are processed and split into smaller chunks.
- Embedding Creation: The text is converted into embeddings using Hugging Face's all-MiniLM-L6-v2 model.
- Vector Storage: Embeddings are stored in a FAISS database for efficient retrieval.
- Question Processing: User questions are contextualized based on chat history.
- Information Retrieval: The system retrieves the most relevant text chunks based on the question.
- Answer Generation: MaritacaAI's sabia-3 model generates an answer in Brazilian Portuguese based on the retrieved chunks and question.
- History Maintenance: Chat history is maintained to provide context in ongoing conversations.
Special Features
- Special formatting for legal document analysis
- Detailed processing of Freedom of Information (FOI) documents
- Cache system for better performance
- Robust error handling
- Adaptive interface that maintains conversation context
Important Notices
- Do not share documents containing sensitive or confidential information
- AI-generated responses may contain errors or inaccuracies
- Always verify information with original sources
- This project is for educational and demonstration purposes
- Use responsibly and in compliance with API usage policies
Contributions
Contributions are welcome! Please:
- Fork the project
- Create a feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Citation
If you use this project in your research or application, please cite:
@software{chatbot-maritacaai-pdfs,
author = {Reinaldo Chaves},
title = {Chatbot with MaritacaAI for PDFs},
year = {2024},
url = {https://github.com/reichaves/chatbotmaritacaai/}
}