### Deliverable 1: Describe the Default Chunking Strategy You Will Use The default chunking strategy that I will use is based on the **`RecursiveCharacterTextSplitter`** method. This splitter divides text into manageable chunks while maintaining semantic coherence, ensuring that the chunks do not break in the middle of thoughts or sentences. It allows for flexible and dynamic chunking based on the nature of the document. #### Key Details of the Default Strategy: - **Adaptive Chunk Sizes**: The splitter first attempts to split the text into large sections (e.g., paragraphs). If a chunk exceeds a certain length, it recursively breaks it down into smaller units (sentences), ensuring each chunk remains within the ideal size for embedding (e.g., 1,000 tokens). - **Flexibility**: It works well for both structured and unstructured documents, making it suitable for a variety of AI-related documents like the *AI Bill of Rights* and *NIST RMF*. - **Context Preservation**: Since it operates recursively, the splitter minimizes the risk of breaking meaningful content, preserving important relationships between concepts. ### Deliverable 2: Articulate a Chunking Strategy You Would Also Like to Test Out In addition to the default strategy, I would like to test out a **Section- and Topic-Based Chunking Strategy** combined with **SemanticChunker**. This strategy would involve splitting the documents based on predefined sections or topics, allowing the chunking process to align more closely with the structure and meaning of the document. #### Key Details of the Alternative Strategy: - **Section-Based Chunking**: This strategy would first divide the document into sections or sub-sections based on headers, topics, or principles (e.g., the five principles in the *AI Bill of Rights* or the different phases in *NIST RMF*). This ensures that each chunk retains a logical structure. - **SemanticChunker Integration**: The SemanticChunker further refines chunking by considering the content’s coherence, creating semantically meaningful segments rather than simply splitting based on length. This would work particularly well for documents like the *AI Bill of Rights*, where each principle is discussed with examples and cases. - **Adaptability**: The strategy allows adaptation based on the specific document, improving retrieval for highly structured documents while maintaining the flexibility to handle less-structured ones. ### Deliverable 3: Describe How and Why You Made These Decisions #### 1. **Default Chunking Strategy**: - **Rationale**: The decision to use `RecursiveCharacterTextSplitter` as the default is driven by its versatility and efficiency. It balances chunk size and coherence without relying on predefined structures, which makes it robust across various document types, including both structured (like the AI Bill of Rights) and unstructured (user-uploaded PDFs). It is particularly useful for retrieval systems where chunk size impacts the performance of embedding models. - **Why It Works**: This strategy allows for better handling of document diversity and ensures that chunks remain contextually rich, which is crucial for accurate retrieval in a conversational AI system. #### 2. **Alternative Section-Based Chunking**: - **Rationale**: The section-based chunking strategy is more targeted toward highly structured documents. For documents like the *NIST AI RMF*, which have clear sections and subsections, breaking the text down by these categories ensures that the system can retrieve contextually related chunks for more precise answers. - **Why It’s Worth Testing**: This strategy enhances retrieval relevance by aligning chunks with specific sections and principles, making it easier to answer detailed or multi-part questions. In combination with the SemanticChunker, it provides the benefit of preserving meaning across larger contexts. #### 3. **Combining Performance and Coherence**: - **Decisions**: I made these decisions to ensure that both performance and coherence are maximized. The default method is fast, flexible, and works well across a variety of documents, while the section-based strategy is designed to improve the quality of responses in documents with clearly defined structures. - **Efficiency Consideration**: By choosing a performant embedding model and efficient chunking strategies, I aimed to balance speed and relevance in the retrieval process, ensuring that the system remains scalable and responsive. ### Summary: - **Default Strategy**: RecursiveCharacterTextSplitter for its adaptability across document types. - **Test Strategy**: Section-based chunking with SemanticChunker for enhancing the accuracy of retrieval from structured documents. - **Decision Rationale**: Both strategies were chosen to provide a balance between flexibility, coherence, and performance, ensuring that the system can effectively handle diverse document structures and retrieval needs. # Problem Statement People are concerned about the implications of AI, and no one seems to understand the right way to think about building ethical and useful AI applications for enterprises. # Understanding the Data ## Blueprint for an AI Bill of rights The "Blueprint for an AI Bill of Rights," published by the White House Office of Science and Technology Policy in October 2022, outlines a framework to ensure that automated systems, including those powered by AI, respect civil rights, privacy, and democratic values. The document is structured around five core principles: Safe and Effective Systems: Automated systems should be designed with input from diverse communities and experts, undergo rigorous pre-deployment testing, and be monitored to ensure safety and effectiveness. This includes protecting users from foreseeable harm and ensuring that systems are not based on inappropriate or irrelevant data. Algorithmic Discrimination Protections: Automated systems must be designed and used in ways that prevent discrimination based on race, gender, religion, and other legally protected categories. This principle includes proactive testing and continuous monitoring to prevent algorithmic bias and discrimination. Data Privacy: Individuals should have control over how their data is collected and used, with automated systems adhering to privacy safeguards by default. This principle emphasizes informed consent, minimizing unnecessary data collection, and prohibiting the misuse of sensitive data, such as in areas of health or finance. Notice and Explanation: People should be aware when automated systems are affecting their rights, opportunities, or access to services, and should be provided with understandable explanations of how these systems operate and influence outcomes. Human Alternatives, Consideration, and Fallback: Users should have the ability to opt out of automated systems in favor of human alternatives where appropriate. There should be mechanisms for people to contest and resolve issues arising from decisions made by automated systems, especially in high-stakes areas like healthcare, education, and criminal justice. The framework aims to protect the public from harmful outcomes of AI while allowing for innovation, recommending transparency, accountability, and fairness across sectors that deploy automated systems. However, the Blueprint is non-binding, meaning it does not constitute enforceable U.S. government policy but instead serves as a guide for best practices​ ## NIST AI Risk Management Framework The document titled **NIST AI 600-1** outlines the **Artificial Intelligence Risk Management Framework (AI RMF)**, with a specific focus on managing risks related to **Generative Artificial Intelligence (GAI)**. Published by the **National Institute of Standards and Technology (NIST)** in July 2024, this framework provides a profile for organizations to manage the risks associated with GAI, consistent with President Biden's Executive Order (EO) 14110 on "Safe, Secure, and Trustworthy AI." ### Key aspects of the document include: 1. **AI Risk Management Framework (AI RMF)**: This framework offers organizations a voluntary guideline to integrate trustworthiness into AI systems. It addresses the unique risks associated with GAI, such as confabulation (AI hallucinations), bias, privacy, security, and misuse for malicious activities. 2. **Suggested Risk Management Actions**: The document provides detailed actions across various phases of AI development and deployment, such as governance, testing, monitoring, and decommissioning, to mitigate risks from GAI. 3. **Generative AI-Specific Risks**: The document discusses risks unique to GAI, including: - **Data privacy risks** (e.g., personal data leakage, sensitive information memorization) - **Environmental impacts** (e.g., high energy consumption during model training) - **Harmful content generation** (e.g., violent or misleading content) - **Bias amplification and model homogenization** - **Security risks**, such as prompt injection and data poisoning 4. **Recommendations for Organizations**: It emphasizes proactive governance, transparency, human oversight, and tailored policies to manage AI risks throughout the entire lifecycle of AI systems. This framework aims to ensure that organizations can deploy GAI systems in a responsible and secure manner while balancing innovation with potential societal impacts. ## Sample Questions from Internet Here is the consolidated set of real user questions regarding AI, ethics, privacy, and risk management with source URLs: 1. **How can companies ensure AI does not violate data privacy laws?** Users are concerned about how AI handles personal data, especially with incidents like data spillovers where information leaks unintentionally across systems. Source: [Stanford HAI](https://hai.stanford.edu/news/privacy-ai-era-how-do-we-protect-our-personal-information), [Transcend](https://transcend.io/blog/ai-and-your-privacy-understanding-the-concerns). 2. **What steps can organizations take to minimize bias in AI models?** Concerns about fairness in AI applications, particularly in hiring, lending, and law enforcement. Source: [ISC2](https://www.isc2.org/Articles/AI-Ethics-Dilemmas-in-Cybersecurity), [JDSupra](https://www.jdsupra.com/legalnews/five-ethics-questions-to-ask-about-your-5303517/). 3. **How do we balance AI-driven cybersecurity with privacy?** Striking a balance between enhancing security and avoiding over-collection of personal data. Source: [ISC2](https://www.isc2.org/Articles/AI-Ethics-Dilemmas-in-Cybersecurity), [HBS Working Knowledge](https://hbswk.hbs.edu/item/navigating-consumer-data-privacy-in-an-ai-world). 4. **What are the legal consequences if an AI system makes an unethical decision?** Understanding liability and compliance when AI systems cause ethical or legal violations. Source: [JDSupra](https://www.jdsupra.com/legalnews/five-ethics-questions-to-ask-about-your-5303517/), [Transcend](https://transcend.io/blog/ai-and-your-privacy-understanding-the-concerns). 5. **How can organizations ensure transparency in AI decision-making?** Ensuring explainability and transparency, especially in high-stakes applications like healthcare and criminal justice. Source: [ISC2](https://www.isc2.org/Articles/AI-Ethics-Dilemmas-in-Cybersecurity), [HBS Working Knowledge](https://hbswk.hbs.edu/item/navigating-consumer-data-privacy-in-an-ai-world). 6. **How can we design AI systems to be ethics and compliance-oriented from the start?** Building AI systems with ethical oversight and controls from the beginning. Source: [JDSupra](https://www.jdsupra.com/legalnews/five-ethics-questions-to-ask-about-your-5303517/). 7. **What are the security risks posed by AI systems?** Addressing the growing risks of security breaches and data leaks with AI technologies. Source: [Transcend](https://transcend.io/blog/ai-and-your-privacy-understanding-the-concerns), [ISC2](https://www.isc2.org/Articles/AI-Ethics-Dilemmas-in-Cybersecurity). 8. **How can AI's impact on job displacement be managed ethically?** Addressing ethical concerns around job displacement due to AI automation. Source: [ISC2](https://www.isc2.org/Articles/AI-Ethics-Dilemmas-in-Cybersecurity). 9. **What measures should be in place to ensure AI systems are transparent and explainable?** Ensuring that AI decisions are explainable, particularly in critical areas like healthcare and finance. Source: [ISC2](https://www.isc2.org/Articles/AI-Ethics-Dilemmas-in-Cybersecurity), [HBS Working Knowledge](https://hbswk.hbs.edu/item/navigating-consumer-data-privacy-in-an-ai-world). 10. **How do companies comply with different AI regulations across regions like the EU and US?** Navigating the differences between GDPR in Europe and US privacy laws. Source: [Transcend](https://transcend.io/blog/ai-and-your-privacy-understanding-the-concerns), [HBS Working Knowledge](https://hbswk.hbs.edu/item/navigating-consumer-data-privacy-in-an-ai-world). Do the organization's personnel and partners receive AI risk management training to enable them to perform their duties and responsibilities consistent with related policies, procedures, and agreements? Will customer data be used to train artificial intelligence, machine learning, automation, or deep learning? Does the organization have an AI Development and Management Policy? Does the organization have policies and procedures in place to define and differentiate roles and responsibilities for human-AI configurations and oversight of AI systems? Who is the third-party AI technology behind your product/service? Has the third-party AI processor been appropriately vetted for risk? If so, what certifications have they obtained? Does the organization implement post-deployment AI system monitoring, including mechanisms for capturing and evaluating user input and other relevant AI actors, appeal and override, decommissioning, incident response, recovery, and change management? Does the organization communicate incidents and errors to relevant AI actors and affected communities and follow documented processes for tracking, responding to, and recovering from incidents and errors? Does your company engage with generative AI/AGI tools internally or throughout your company's product line? If generative AI/AGI is incorporated into the product, please describe any governance policies or procedures. Describe the controls in place to ensure our data is transmitted securely and is logically and/or physically segmented from those of other customers. These links provide direct access to discussions about AI ethics, privacy, and risk management. ## Document Structure - Both of these documents follows a structure which will make easier to chunk the documents but the implemenation of such section/topic based strategy is complex and timeconsuming as this needs to be dynamic based on the document uploaded. - We could chunk the pdf in sections then sub-sections then pages and sentence/para. This will nicely break the document preserving document structure. - There is a chance the user may uploadd a document that is not structure which means going with the assumption that the document will always be in a structured format will not work. # Dealing with the data Considering all the above going with a generic approach to cover all uses would be sensible. We will go with usual `PyMuPDFLoader` library to load the pdf and chunk the documents using `RecursiveCharacterTextSplitter` to begin with. Would like to use `PyPDFium2Loader` which is very slow compared to the `PyMuPDFLoader`. IF our use case requires the populating the vector store before hand we could go with this loader. The `PyPDFium2Loader` loader took 2mins 30secs to load these two pdf. Comparing the quality of the output there is not much difference between the two. So will use `PyMuPDFLoader` To improve the quality of the retrival we can group the documents by similar context which provides better context retrival. Chunking strategy `RecursiveCharacterTextSplitter` `SemanticChunker` Improved Coherence: Chunks are more likely to contain complete thoughts or ideas. Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced. Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs. Potential for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments. Advanced retrieval techniques tried 1. Context enrichment - Creates some duplicates need to investigate why for later 2. Contextual Compression - Creates better response but takes time. Will need to check if streaming helps. More for later. Experimented with above chunking strategy and found that `RecursiveCharacterTextSplitter` with contextual compression provides better results. # Choise of embedding model The quality of generation is directly proportional to the quality of the retrieval and at the same time we wanted to choose smaller model that is performant. I choose to use the `snowflake-arctic-embed-l` embedding model as it is small with 334 Million parameter with 1024 dimension support. Currently it is at 27 rank the MTEB leader board which suggest to me that it is efficient competing with other large models. # Consolidation ### 1. **Aligning Chunking Strategy with Context** - **Current Strategy**: You mention using `RecursiveCharacterTextSplitter` and `SemanticChunker`, which is a good start. - **Improvement**: Since both documents have well-defined sections and are structured (NIST RMF includes clear sections, and AI Bill of Rights is principles-based), it would be beneficial to first chunk based on sections and subsections, while combining it with context-based chunking. Instead of focusing on one chunking method, you can adapt based on the structure of each document. - **Dynamic Chunking**: Also, mention how the method would dynamically adapt to less-structured documents if uploaded in the future, ensuring scalability. ### 2. **Specific Chunking Examples** - **Blueprint for AI Bill of Rights**: - Principles can form separate chunks (e.g., *Safe and Effective Systems*, *Algorithmic Discrimination Protections*). - Subsections can also include further breaking down examples or cases cited under each principle. - **NIST AI RMF**: - As each section (such as "Suggested Actions to Manage Risks" or "GAI Risk Overview") has detailed subcategories, chunk them accordingly. - Include how you will preserve context when chunking specific actions. ### 3. **Incorporating Expected Questions** - You have already listed good examples of user questions. However, for the purpose of improving retrieval: - **Enhance Contextual Retrieval**: Suggest tailoring your vector store to group similar questions by topic, such as data privacy, bias prevention, and AI safety. This allows better retrieval of relevant chunks across both documents when users ask questions. - **Example**: A question about "data privacy" should retrieve answers both from the *Data Privacy* section of AI Bill of Rights and the *Data Privacy Risks* section of the NIST RMF, creating a more comprehensive answer. ### 4. **Document Summarization in the Vector Store** - If possible, create summarizations for sections and topics within both documents and store them in your vector database. Summaries improve quick lookup without requiring a deep scan through every chunk. ### 5. **Advanced Techniques** - **Context Enrichment**: Mention that it needs further investigation but is a promising avenue. Focus on eliminating duplication by refining preprocessing or filtering steps when enriching. - **Contextual Compression**: Explain how you might use this to generate concise answers that retain meaning, which could be useful for long or dense document sections. ### 6. **Handling Duplicate Content** - Add a comment about how duplicate information across different sections can be handled by maintaining a cache or reference of repeated content in different chunks to avoid redundancies. ### 7. **Performance and Efficiency** - Since `PyPDFium2Loader` is slower, clarify that you will use it only if high-quality, OCR-accurate extraction is critical, but `PyMuPDFLoader` is your preferred option for efficiency and initial loading. This could be useful for streaming applications. ### Enhanced Structure for Response: 1. **Problem Statement** - Continue with the problem definition, but expand on real-world implications of ethical and risk management in AI. 2. **Understanding the Data** - Break the two documents down clearly into sections and discuss specific strategies for how chunking can preserve meaning within these sections. 3. **Advanced Retrieval & Chunking** - Expand this section to include the chunking methods you've outlined, and specify the improvements you will explore (e.g., dynamic chunking, context-based grouping). 4. **Performance Considerations** - Detail how you will balance quality and performance based on user needs and document types. This would strengthen your approach, improving both the technical accuracy and user experience.