SwiLTra-Bench: The Swiss Legal Translation Benchmark
Abstract
In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.
Community
In Switzerland legal translation is uniquely important due to the country's
four official languages and requirements for multilingual legal documentation.
However, this process traditionally relies on professionals who must be both
legal experts and skilled translators -- creating bottlenecks and impacting
effective access to justice. To address this challenge, we introduce
SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned
Swiss legal translation pairs comprising laws, headnotes, and press releases
across all Swiss languages along with English, designed to evaluate LLM-based
translation systems. Our systematic evaluation reveals that frontier models
achieve superior translation performance across all document types, while
specialized translation systems excel specifically in laws but under-perform in
headnotes. Through rigorous testing and human expert validation, we demonstrate
that while fine-tuning open SLMs significantly improves their translation
quality, they still lag behind the best zero-shot prompted frontier models such
as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM
evaluation system that aligns best with human expert assessments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages (2025)
- AFRIDOC-MT: Document-level MT Corpus for African Languages (2025)
- BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models (2025)
- Can Large Language Models Predict the Outcome of Judicial Decisions? (2025)
- Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs (2025)
- Automatic Input Rewriting Improves Translation with Large Language Models (2025)
- Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper