BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Paper • 2409.04599 • Published Sep 6, 2024 • 1
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models Paper • 2311.09194 • Published Nov 15, 2023
Toxicity of the Commons: Curating Open-Source Pre-Training Data Paper • 2410.22587 • Published Oct 29, 2024 • 10
People cannot distinguish GPT-4 from a human in a Turing test Paper • 2405.08007 • Published May 9, 2024
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement Paper • 2403.13754 • Published Mar 20, 2024
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages Paper • 2403.00686 • Published Mar 1, 2024
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages Paper • 2311.09205 • Published Nov 15, 2023