Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
Abstract
Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.
Community
Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts, and provide a pathway to do that!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Generative AI Training and Copyright Law (2025)
- Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers (2025)
- All That Glitters is Not Novel: Plagiarism in AI Generated Research (2025)
- NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM (2025)
- SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence (2025)
- LLMs as Repositories of Factual Knowledge: Limitations and Solutions (2025)
- LegalGuardian: A Privacy-Preserving Framework for Secure Integration of Large Language Models in Legal Practice (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper