|
--- |
|
license: mit |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
tags: |
|
- linux |
|
- gpl2 |
|
- mit |
|
widget: |
|
- text: "<|start_file|>" |
|
- temperature: 0.3 |
|
- do_sample: True |
|
--- |
|
|
|
|
|
# LaaM - Linux as a Model |
|
What happens when we train a simple transformer model to memorize the GPL2 source of the Linux kernel? |
|
|
|
## Source |
|
[https://github.com/mjbommar/laam](https://github.com/mjbommar/laam) |
|
|
|
## Motivation |
|
Simply put, the OSI is making a grave mistake by ignoring the most important transitive dependency in AI - the training data. |
|
|
|
As of the latest version of [The Open Source AI Definition (draft v. 0.0.8)](https://opensource.org/deepdive/drafts/the-open-source-ai-definition-draft-v-0-0-8), |
|
the OSI has decided that the legal status of training data is irrelevant to their subsequent "approval" of models as "open." |
|
|
|
The argument in favor of this omission is that such a requirement would be inconvenient and legally ambiguous |
|
in some jurisdictions. |
|
|
|
This would be like Creative Commons encouraging the authors of textual or audiovisual works to ignore |
|
the terms of copyleft licenses. |
|
|
|
**Simply put, organizations like the OSI must take a clear, common sense stance: "AI" models like text or multimodal LLMs |
|
cannot be considered "open" if they are trained on "stolen" or "closed source" data.** |
|
|
|
## Details |
|
To demonstrate how ridiculous the OSI's position is, I have trained simple transformer models to memorize the |
|
source code of Linux version 1.0, which is licensed under the GPL2. |
|
|
|
This model is documented and trained in perfect compliance with the OSI's draft guidance on Data Information, Code, |
|
and Model sections. All source code is available in the GitHub repository, all dependencies are open source, |
|
all input training data is directly described by the source code, and all model weights are available on |
|
Hugging Face. |
|
|
|
## Example Model - 5M parameter Llama2 architecture |
|
For example, this 5M parameter model can be trained on practically any device in a minutes to hours. The model trivially |
|
emits copies of Linux 1.0 source code. For example, using the HuggingFace hub copy at `mjbommar/linux-as-a-model-5M`: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> p = pipeline('text-generation', 'mjbommar/linux-as-a-model-5M') |
|
>>> print(p('', max_new_tokens=256, do_sample=True, temperature=0.2)[0]['generated_text']) |
|
linux/drivers/net/3c503.c /* 3c503.c: A shared-memory NS8390 ethernet driver for linux. */ |
|
/* |
|
Written 1992,1993 by Donald Becker. |
|
|
|
Copyright 1993 United States Government as represented by the |
|
Director, National Security Agency. This software may be used and |
|
distributed according to the terms of the GNU Public License, |
|
incorporated herein by reference. |
|
|
|
This driver should work with the 3c503 and 3c503/16. It should be used |
|
in shared memory mode for best performance, although it may also work |
|
in programmed-I/O mode. |
|
|
|
The Author may be reached as [email protected] or |
|
C/O Supercomputing Research Ctr., 17100 Science Dr., Bowie MD 20715 |
|
*/ |
|
|
|
``` |
|
|
|
## License |
|
For the sake of demonstration, I have licensed the model source **and weights** under the MIT terms, |
|
and the OSI should support this model as completely open and compliant with their draft guidance. |
|
|
|
|
|
## Train your own model |
|
``` |
|
# ensure poetry available |
|
# curl -sSL https://install.python-poetry.org | python3 - |
|
|
|
# setup poetry environment |
|
$ poetry install --no-root |
|
|
|
# optionally install flash-attn |
|
# poetry run pip install wheel |
|
# MAX_JOBS=4 poetry run pip install flash-attn --no-build-isolation |
|
|
|
# train a tokenizer with fixed vocab size on linux version 1.0 |
|
$ PYTHONPATH=. poetry run python3 -m laam.commands.train_tokenizer \ |
|
--version v1.0/1.0 \ |
|
--vocab-size 32768 |
|
|
|
# train a 5M parameter model on it |
|
|
|
# stage 1: large batch size, 1e-3 learning rate to safely converge near solution |
|
$ PYTHONPATH=. poetry run accelerate launch \ |
|
laam/commands/train_llama.py \ |
|
--version v1.0/1.0 \ |
|
--precision bfloat16 \ |
|
--hidden_size 64 \ |
|
--intermediate_size 256 \ |
|
--num_hidden_layers 8 \ |
|
--num_attention_heads 32 \ |
|
--max_position_embeddings 512 \ |
|
--learning_rate 0.001 \ |
|
--batch_size 64 \ |
|
--epochs 100 |
|
|
|
# stage 2: single sample batches with 1e-4 learning rate to memorize |
|
$ PYTHONPATH=. poetry run accelerate launch \ |
|
laam/commands/train_llama.py \ |
|
--version v1.0/1.0 \ |
|
--precision bfloat16 \ |
|
--reload \ |
|
--learning_rate 0.0001 \ |
|
--batch_size 1 \ |
|
--epochs 100 |
|
``` |