SWEb Markdown Extractor

This model is developed by the NLU team at AI Sweden as a primary content extractor from web pages, and was used to produce the SWEb dataset. For more details, please see the SWEb paper and SWEb source code.

In our source code, you'll find:

Our training and test extraction data
Annotation tool for annotating additional data
Training and inference scripts

Model Details

Model Description

The model can be used to extract the primary content from websites. The below example show how it can be used (taken from here):

import os
import requests
from torch.nn.functional import sigmoid
from pipeline.warc_processing import ConvertToMarkdown
from transformers import AutoTokenizer, AutoModelForTokenClassification

# 1. Download a webpage
resp = requests.get("https://www.ai.se/sv/nyheter/nobelpriset-i-fysik-och-kemi-till-banbrytande-ai-forskning")

# 2. Convert HTML to markdown using pandoc
markdown = ConvertToMarkdown.convert_html_to_markdown(resp.content, pandoc_path=f"{os.environ['HOME']}/bin/pandoc")  # path to pandoc 2.9.2.1, see INSTALL.md

# 3. Extract text by classifying each line using trained model
tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor")
model = AutoModelForTokenClassification.from_pretrained("AI-Sweden-Models/SWEb-markdown-extractor").eval()
tokens = tokenizer(markdown.replace("\n", tokenizer.sep_token), return_tensors="pt", add_special_tokens=False, truncation=True)
tokens["line_sep_token_ids"] = (tokens.input_ids[0] == tokenizer.sep_token_id).nonzero()[None, :, 0]
logits = model(**tokens)[0]
extracted_lines = [
  line for line, pred in zip(markdown.split("\n"), sigmoid(logits))
  if pred > 0.05
]

# Print extracted text
print("\n".join(extracted_lines))

outputs:

# Nobelpriset i fysik och kemi till banbrytande AI-forskning

tisdag, oktober 8, 2024

Två Nobelpris till AI 2024\! Det i fysik går till forskning som lagt grunden till maskininlärning och artificiell intelligens, och det i kemi till Google DeepMinds AlphaFold2

*– Det är fantastiskt att det här viktiga arbetet får ett sådant erkännande. Särskilt den tillämpade AI som uppmärksammas i Kemipriset*, säger Johanna Bergman, Director of Strategic Initiatives på AI Sweden.

...

Model Sources

Repository: https://github.com/aidotse/SWEb/tree/main
Paper: https://arxiv.org/abs/2410.04456

Uses

We propose using model based extractors as they provide more flexibility in shaping the extraction through data rather than rules. This model was trained on Scandinavian webpages in particular, so we expect extraction for webpages in these languages to work better than other languages. However, annotating using our tool is swift and the model learns from small amounts of data.

Bias, Risks, and Limitations

The text extraction model presented here is designed to extract primary content from webpages, but it is important to acknowledge its inherent biases, risks, and limitations. The following aspects should be considered when using this model and the datasets derived from it.

Incomplete Context and Information: Webpages often contain a mix of primary content, supplementary information, and context in surrounding elements (such as comments, metadata, or links). The text extraction model focuses on extracting the "main" content, which can lead to a loss of nuance or essential context. This limitation may affect the quality and usefulness of the pretraining datasets, especially in scenarios where contextual information is crucial for understanding.
Domain-Specific Limitations: The effectiveness of the text extraction model may vary depending on the domain or structure of the webpages. For example, pages with heavy advertisements, complex layouts, or dynamically generated content might lead to extraction errors or incomplete outputs. These limitations can lead to a dataset that underrepresents content from such domains or introduces noise due to incorrect extraction.
Content Filtering and Ethical Concerns: The extracted text may include offensive, explicit, or otherwise harmful content. Without adequate content filtering, this material could end up in pretraining datasets, affecting the behavior of downstream language models. Users must be aware of the ethical implications and potential harms of training models on unfiltered web data.
Regional and Language Bias: The model is trained is predominantly available in the Scandinavian languages and regions, which can lead to an overrepresentation of these languages in the extracted data.

These biases, risks, and limitations emphasize the need for careful curation, filtering, and post-processing of the extracted content to mitigate negative impacts on downstream applications. Users of this model should consider integrating diverse sources, employing bias mitigation techniques, and conducting ongoing evaluations to reduce the potential harms associated with large-scale pretraining.

Training Details

Training Data

Please find our training and test data here

Training Script

Please find our training script here

Citation

To cite this work, please use the following:

@misc{norlund2024sweblargewebdataset,
      title={SWEb: A Large Web Dataset for the Scandinavian Languages}, 
      author={Tobias Norlund and Tim Isbister and Amaru Cuba Gyllensten and Paul Dos Santos and Danila Petrelli and Ariel Ekgren and Magnus Sahlgren},
      year={2024},
      eprint={2410.04456},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.04456}, 
}

tobiasnorlund
/

SWEb-markdown-extractor

You need to agree to share your contact information to access this model