--- library_name: transformers license: apache-2.0 base_model: - severinsimmler/xlm-roberta-longformer-base-16384 --- # SWEb Markdown Extractor This model is developed by the NLU team at [AI Sweden](ai.se) as a **primary content extractor** from web pages, and was used to produce the [SWEb dataset](https://huggingface.co./datasets/AI-Sweden-Models/SWEb). For more details, please see [the SWEb paper](https://arxiv.org/abs/2410.04456) and [SWEb source code](https://github.com/aidotse/SWEb/tree/main). In our source code, you'll find: - Our training and test extraction data - Annotation tool for annotating additional data - Training and inference scripts ## Model Details ### Model Description The model can be used to extract the primary content from websites. The below example show how it can be used (taken from [here](https://github.com/aidotse/SWEb/tree/main)): ```python import os import requests from torch.nn.functional import sigmoid from pipeline.warc_processing import ConvertToMarkdown from transformers import AutoTokenizer, AutoModelForTokenClassification # 1. Download a webpage resp = requests.get("https://www.ai.se/sv/nyheter/nobelpriset-i-fysik-och-kemi-till-banbrytande-ai-forskning") # 2. Convert HTML to markdown using pandoc markdown = ConvertToMarkdown.convert_html_to_markdown(resp.content, pandoc_path=f"{os.environ['HOME']}/bin/pandoc") # path to pandoc 2.9.2.1, see INSTALL.md # 3. Extract text by classifying each line using trained model tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor") model = AutoModelForTokenClassification.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor").eval() tokens = tokenizer(markdown.replace("\n", tokenizer.sep_token), return_tensors="pt", add_special_tokens=False, truncation=True) tokens["line_sep_token_ids"] = (tokens.input_ids[0] == tokenizer.sep_token_id).nonzero()[None, :, 0] logits = model(**tokens)[0] extracted_lines = [ line for line, pred in zip(markdown.split("\n"), sigmoid(logits)) if pred > 0.05 ] # Print extracted text print("\n".join(extracted_lines)) ``` outputs: ```markdown # Nobelpriset i fysik och kemi till banbrytande AI-forskning tisdag, oktober 8, 2024 Två Nobelpris till AI 2024\! Det i fysik går till forskning som lagt grunden till maskininlärning och artificiell intelligens, och det i kemi till Google DeepMinds AlphaFold2 *– Det är fantastiskt att det här viktiga arbetet får ett sådant erkännande. Särskilt den tillämpade AI som uppmärksammas i Kemipriset*, säger Johanna Bergman, Director of Strategic Initiatives på AI Sweden. ... ``` ### Model Sources - **Repository:** https://github.com/aidotse/SWEb/tree/main - **Paper:** https://arxiv.org/abs/2410.04456 ## Uses We propose using model based extractors as they provide more flexibility in shaping the extraction through data rather than rules. This model was trained on Scandinavian webpages in particular, so we expect extraction for webpages in these languages to work better than other languages. However, annotating using our [tool](https://github.com/aidotse/SWEb/tree/main/annotation_tool) is swift and the model learns from small amounts of data. ## Bias, Risks, and Limitations The text extraction model presented here is designed to extract primary content from webpages, but it is important to acknowledge its inherent biases, risks, and limitations. The following aspects should be considered when using this model and the datasets derived from it. - Incomplete Context and Information: Webpages often contain a mix of primary content, supplementary information, and context in surrounding elements (such as comments, metadata, or links). The text extraction model focuses on extracting the "main" content, which can lead to a loss of nuance or essential context. This limitation may affect the quality and usefulness of the pretraining datasets, especially in scenarios where contextual information is crucial for understanding. - Domain-Specific Limitations: The effectiveness of the text extraction model may vary depending on the domain or structure of the webpages. For example, pages with heavy advertisements, complex layouts, or dynamically generated content might lead to extraction errors or incomplete outputs. These limitations can lead to a dataset that underrepresents content from such domains or introduces noise due to incorrect extraction. - Content Filtering and Ethical Concerns: The extracted text may include offensive, explicit, or otherwise harmful content. Without adequate content filtering, this material could end up in pretraining datasets, affecting the behavior of downstream language models. Users must be aware of the ethical implications and potential harms of training models on unfiltered web data. - Regional and Language Bias: The model is trained is predominantly available in the Scandinavian languages and regions, which can lead to an overrepresentation of these languages in the extracted data. These biases, risks, and limitations emphasize the need for careful curation, filtering, and post-processing of the extracted content to mitigate negative impacts on downstream applications. Users of this model should consider integrating diverse sources, employing bias mitigation techniques, and conducting ongoing evaluations to reduce the potential harms associated with large-scale pretraining. ## Training Details ### Training Data Please find our training and test data [here](https://github.com/aidotse/SWEb/blob/main/annotation_tool/backend/data/data.jsonl) ### Training Script Please find our training script [here](https://github.com/aidotse/SWEb/blob/main/pipeline/line_classification/train.py) ## Citation To cite this work, please use the following: ``` @misc{norlund2024sweblargewebdataset, title={SWEb: A Large Web Dataset for the Scandinavian Languages}, author={Tobias Norlund and Tim Isbister and Amaru Cuba Gyllensten and Paul Dos Santos and Danila Petrelli and Ariel Ekgren and Magnus Sahlgren}, year={2024}, eprint={2410.04456}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.04456}, } ```