LLMDH/rag_2 · Hugging Face

Pleias-Pico is a 353 million parameters specialized language model designed by PleIAs for Retrieval-Augmented Generation.

Similarly to its base model, Pleias-350m, Pleias-Pico aims to be a fully open model (weights, code, data), only trained on content with a permissible license and fully compliant with the European AI Act.

Description

Pleias-Pico is continuous pretrain of Pleias-350m on a new dataset of 45,088,768,000 tokens modeling common retrieval tasks. All the content of the dataset is ultimately coming from Common Corpus.

Pleias-Pico includes the main features of the original base model:

Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
Extensive multilingual support for main European languages: English, French, German, Spanish, Italian, Dutch, Latin, Portuguese and Polish.
Extremely low level of toxicity and problematic content.

Pleias-Pico supports retrieval-augmented generation with enhanced verifiability, source analysis and grounding on submitted sources. This includes:

Standardized structure and special tokens to include queries, sources, references.
Anticipation of various query forms in multiple languages, from actual drafted questions to unstructured list of keyword search.
Source analysis/criticism which also acts as an integrated reranker step.
Generation of ground answers with references and excerpts linked to the original sources.

Initial tests have shown that the RAG design has significantly improved the factuality and verifiability of the model. Even when the grounding does not work perfectly, the information remains much closer to the original sources.

As a result, Pleias-Pico has been already tested and integrated into multiple applied RAG projects, including Pleias's flagship application Scholasticai.

Training

Pleias-Pico was trained at Jean-Zay with 16 h100s with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.

Pleias-Pico derives from the last checkpoint of Pleias-350m (518,000). The training schedule reused the last learning rate value (6e-5) without decay for 90,000 steps.

Training covers the entire RAG dataset we have been designing out of Common Corpus for 1 epoch.

Further experiments were made with different learning rate values: none of theses tests have provided a better convergence than the one obtained with the final learning rate from the base model.

Inference

Pleias-Pico relies on special tokens to encode the core RAG functionalities:

The bare structure is like this:

<|query_start|>Is Wikipedia considered reliable?<|query_end|>
<|source_start|><|source_id|>[numeric id] [text]<|source_end|>
<|source_start|><|source_id|>[numeric id] [text]<|source_end|>
<|source_start|><|source_id|>[numeric id] [text]<|source_end|>
<|query_analysis_start|>

A typical example, with excerpts drawn from a Wikipedia article on Wikipedia

<|query_start|>Is Wikipedia considered reliable?<|query_end|>
<|source_start|><|source_id|>1 Articles for traditional encyclopedias such as Encyclopædia Britannica are written by experts, lending such encyclopedias a reputation for accuracy. However, a peer review in 2005 of forty-two scientific entries on both Wikipedia and Encyclopædia Britannica by the science journal Nature found few differences in accuracy, and concluded that 'the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three.' Joseph Reagle suggested that while the study reflects 'a topical strength of Wikipedia contributors' in science articles.<|source_end|>
<|source_start|><|source_id|>2 As a consequence of the open structure, Wikipedia 'makes no guarantee of validity' of its content, since no one is ultimately responsible for any claims appearing in it. Concerns have been raised by PC World in 2009 regarding the lack of accountability that results from users' anonymity, the insertion of false information, vandalism, and similar problems. Legal Research in a Nutshell (2011), cites Wikipedia as a 'general source' that 'can be a realboon' in 'coming up to speed in the law governing a situation' and, 'while not authoritative, can provide basic facts as well as leads to more in-depth resources'.<|source_end|>
<|source_start|><|source_id|>3 Wikipedia's open structure inherently makes it an easy target for Internet trolls, spammers, and various forms of paid advocacy seen as counterproductive to the maintenance of a neutral and verifiable online encyclopedia. In response to paid advocacy editing and undisclosed editing issues, Wikipedia was reported in an article in The Wall Street Journal to have strengthened its rules and laws against undisclosed editing. The article stated that: 'Beginning Monday [from the date of the article, June 16, 2014], changes in Wikipedia's terms of use will require anyone paid to edit articles to disclose that arrangement.<|source_end|>
<|query_analysis_start|>

Expected output for the draft is like this:

<|query_analysis_start|>
The query seeks to understand the relationship between Wikipedia's open structure and its impact on user trust and content quality. It's a complex question that touches upon both technical aspects (open structure) and social implications (trust and quality). The query appears to come from someone with some knowledge of Wikipedia, as it assumes familiarity with the concept of open structure. The question is answerable but requires specific information about Wikipedia's open nature and its effects on user behavior.
<|query_analysis_end|>

<|query_report_start|>
ready to be answered
<|query_report_end|>

<|source_analysis_start|>
The sources provide relevant information for answering this query:
- <|source_id|>1 offers direct evidence about Wikipedia's open structure and its impact on accuracy, citing specific studies and expert opinions.
- <|source_id|>2 provides context about Wikipedia's open structure and its relationship to legal accountability, though it's more focused on legal implications.
- <|source_id|>3 is less relevant as it focuses on specific technical aspects of Wikipedia's structure rather than the broader impact on user trust and quality.

The sources appear to be from academic or professional contexts, with <|source_id|>1 being particularly valuable as it directly addresses the query's main points. While the sources are somewhat dated (references to 2005 studies), they provide sufficient evidence to construct a meaningful answer.
<|source_analysis_end|>

<|draft_start|>
The answer should be structured around three main points:

1. Wikipedia's Open Structure and Its Impact on Accuracy:
- Drawing from <|source_id|>1 to explain how Wikipedia's open structure affects accuracy
- Using the specific example of accuracy rates (4 inaccuracies in Britannica)

2. Impact on User Trust:
- Using <|source_id|>1's findings about Wikipedia's reputation and credibility
- Incorporating the expert opinion from <|source_id|>1 about the "topical strength" of Wikipedia

3. Quality Implications:
- Drawing from <|source_id|>1's discussion of accuracy rates and their relationship to quality
- Including the broader context of Wikipedia's role in maintaining credibility

The tone should be analytical and evidence-based, acknowledging both the technical aspects and social implications of Wikipedia's open structure. The answer should emphasize how these factors interact to affect both user trust and content quality.
<|draft_end|>

And for the actual answer:

<|answer_start|>
The relationship between Wikipedia's open structure and its impact on user trust and content quality reveals fascinating dynamics that shape our digital landscape. Let me walk you through this complex interplay.

At its core, Wikipedia's open structure creates a dynamic environment where accuracy matters significantly. While Wikipedia maintains a topical strength, it's worth noting that even experts like Joseph Reagle have found that even minor inaccuracies in entries can be problematic, as evidenced by a study of around four inaccuracies in BritannicaJoseph Reagle suggested that while the study reflects 'a topical strength of Wikipedia contributors' in science articles.Britannica, about three. About four inaccuracies; Britannica, about three.. This suggests that while the structure itself may not guarantee absolute accuracy, it does influence how users perceive and interact with the content.

The impact on user trust is particularly noteworthy. Research shows that Wikipedia's reputation is significantly influenced by its accuracy rate. A study of forty-two scientific entries found that approximately four inaccuracies in Britannica were attributed to Wikipedia's qualitya peer review in 2005 of forty-two scientific entries on both Wikipedia and Encyclopædia Britannica by the science journal Nature found few differences in accuracy, and concluded that 'the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three.'. This finding highlights how the structure itself can affect how users perceive and interact with the content.

Quality considerations play a crucial role in maintaining Wikipedia's credibility. Wikipedia's reputation is intrinsically tied to its accuracy, which is why it's considered a reliable source of informationWikipedia's open structure inherently makes it an easy target for Internet trolls, spammers, and various forms of paid advocacy seen as counterproductive to the maintenance of a neutral and verifiable online encyclopedia.. However, this doesn't mean that Wikipedia's reliability is absolute.

The broader context of Wikipedia's role in maintaining credibility adds another layer of complexity. Wikipedia's reputation is intrinsically tied to its accuracy, which is why it's considered a reliable source of informationWikipedia's open structure inherently makes it an easy target for Internet trolls, spammers, and various forms of paid advocacy seen as counterproductive to the maintenance of a neutral and verifiable online encyclopedia.. This creates a dynamic relationship between the structure and the content that affects both user trust and content quality.

The impact on user trust extends beyond mere accuracy concerns. When users trust Wikipedia's reliability, they're more likely to engage with the content. This creates a feedback loop where the structure itself can influence how users perceive and interact with the content, ultimately shaping the quality of the information they encounter.

It's important to note that while Wikipedia's open structure provides a framework for understanding its reliability, it's just one piece of a larger puzzle. The relationship between structure and quality continues to evolve as the digital landscape changes and new challenges emerge. Understanding these dynamics helps us appreciate how the structure of Wikipedia can serve as a catalyst for both content quality and user trust.
<|answer_end|>

As a specialized language model, Pleias-Pico will be unable to work properly with prompts that detracts from that design.

Standard prompt design

This is a structured

prompt = """<|query_start|>Is Wikipedia considered reliable?<|query_end|> <|source_start|><|source_id|>1 Articles for traditional encyclopedias such as Encyclopædia Britannica are written by experts, lending such encyclopedias a reputation for accuracy. However, a peer review in 2005 of forty-two scientific entries on both Wikipedia and Encyclopædia Britannica by the science journal Nature found few differences in accuracy, and concluded that 'the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three.' Joseph Reagle suggested that while the study reflects 'a topical strength of Wikipedia contributors' in science articles.<|source_end|> <|source_start|><|source_id|>2 As a consequence of the open structure, Wikipedia 'makes no guarantee of validity' of its content, since no one is ultimately responsible for any claims appearing in it. Concerns have been raised by PC World in 2009 regarding the lack of accountability that results from users' anonymity, the insertion of false information, vandalism, and similar problems. Legal Research in a Nutshell (2011), cites Wikipedia as a 'general source' that 'can be a realboon' in 'coming up to speed in the law governing a situation' and, 'while not authoritative, can provide basic facts as well as leads to more in-depth resources'.<|source_end|> <|source_start|><|source_id|>3 Wikipedia's open structure inherently makes it an easy target for Internet trolls, spammers, and various forms of paid advocacy seen as counterproductive to the maintenance of a neutral and verifiable online encyclopedia. In response to paid advocacy editing and undisclosed editing issues, Wikipedia was reported in an article in The Wall Street Journal to have strengthened its rules and laws against undisclosed editing. The article stated that: 'Beginning Monday [from the date of the article, June 16, 2014], changes in Wikipedia's terms of use will require anyone paid to edit articles to disclose that arrangement.<|source_end|> <|query_analysis_start|>""" print(f"Prompt: {prompt}")

LLMDH
/

rag_2

Description

Training

Inference

Dataset used to train LLMDH/rag_2