--- license: llama3 language: - en pipeline_tag: text-generation tags: - pytorch - llama - llama-3 datasets: - AnonymousAuthors/OSS-License-Terms base_model: - meta-llama/Meta-Llama-3-8B --- # License-Llama3-8B ## Introduction We developed License-Llama3-8B, the first large language model (LLM) specifically designed for identifying terms in open-source software (OSS) licenses. We achieved this by first constructing a domain-specific dataset based on 3,238 OSS licenses, and then performing domain-adaptive pre-training (DAPT) and supervised fine-tuning (SFT) on the meta-llama/Meta-Llama-3-8B model. License-Llama3-8B supports the identification of 27 common license terms and their corresponding three types of attitudes.The experimental results demonstrate that License-Llama3-8B achieves a precision of 92.63% and a recall of 83.89% in license term identification. In the combined task of term and attitude identification, it achieves a precision of 90.04% and a recall of 81.55%. ## Use with transformers Starting with transformers >= 4.42.4 onward, you can run inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function. Make sure to update your transformers installation via pip install --upgrade transformers. ````python import transformers import torch import json # Definition of license terms and attitudes Terms = { 'Place Warranty': 'offer warranty protection (or other support), place warranty on the software licensed', 'Add License Terms': 'provide additional license terms', 'Add Own Notices': 'add own notices in derivative works', 'Ask Distribution Fee': 'ask a fee to distribute a copy', 'Combine Libraries': 'place side by side with a library (that is not an application or covered work)', 'Copy': 'reproduce the original work in copies', 'Distribute': 'distribute original or modified derivative works', 'Modify': 'modify the software and create derivatives', 'Grant Patents': 'grant rights to use copyrighted patents by the licensor, practice patent claims of contributors to the code', 'Publicly Display': 'display the original work publicly', 'Publicly Perform': 'perform the original work publicly', 'Sublicense': 'incorporate the work into something that has a more restrictive license', 'Commercial Use': 'use the software for commercial purposes', 'Private Use': 'use or modify the software freely or privately without distributing it', 'State Changes': 'state significant changes made to the software, cause modified files to carry prominent notices', 'Add Statement For Additional Terms': 'place a statement of the additional terms that apply', 'Retain Copyright Notice': 'retain the copyright notice in all copies or substantial uses of the work.', 'Include License': 'include the full text of license(license copy) in modified software', 'Include Notice': 'notice text needs to be distributed (if it exists) with any derivative work', 'Offer Source Code': 'disclose your source code when you distribute the software and make the source for the library available', 'Rename': 'the name of the derivative work must differ from original, change software name as to not misrepresent them as the original software', 'Retain Disclaimer': 'redistributions of source code must retain disclaimer', 'Use TradeMark': 'use contributor’s name, trademark or logo', 'Give Credit': 'give explicit credit or acknowledgement to the author with the software', 'Include Install Instructions': 'include build & install instructions necessary to modify and reinstall the software', 'Liable for Damages': 'the licensor cannot be held liable for any damages arising from the use of the software', 'Keep Same License': 'distribute the modified or derived work of the software under the terms and conditions of this license' } Attitudes = {"CAN": "Indicates that the licensee can perform the actions, commonly used expressions include: hereby grants to you, you may, you can", "CANNOT": "Indicates that the licensee is not allowed to perform the actions, commonly used expressions include: you may not, you can not, without, prohibit, refuse, disallow, decline, against", "MUST": "Indicates that the licensee must perform the actions, commonly used expressions include: you must, you should, as long as, shall, provided that, ensure that, ask that, have to"} # Create the Prompt def create_prompt(term_definition, attitude_definition, license_text): exm = { "Distribute": "CAN", "Use": "CAN", "Modify": "CANNOT" } prompt = f"""### OBJECTIVE Identify the terms and corresponding attitudes contained in the given license text based on the definition of license terms and attitudes. ### DEFINITION OF TERMS {term_definition} ### DEFINITION OF ATTITUDES {attitude_definition} ### LICENSE TEXT {license_text} ### RESPONSE Output the results in the form of JSON key-value pairs, where the key is the term name and the value is the corresponding attitude name. ### Output Format Example ``` {json.dumps(exm, indent=2)} ``` """ return prompt # Load model and create a pipeline model_id = "AnonymousAuthors/License-Llama3-8B" pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device="auto" ) # An example of extracting license terms license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it." prompt = create_prompt(Terms, Attitudes, license_text) terminators = [ pipeline.tokenizer.eos_token_id, pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>") ] outputs = pipeline( prompt, max_new_tokens=512, eos_token_id=terminators, pad_token_id=pipeline.tokenizer.eos_token_id, do_sample=True, temperature=0.3, top_p=0.7, ) response = outputs[0]["generated_text"][len(prompt):] print(f"License Text: {license_text}\n") print(f"LLM Response: {response}\n") ```` ## Use with vLLM vLLM is a fast and easy-to-use library for LLM inference and serving. Install vLLM with pip: ```bash pip install vllm == 0.3.1 ``` Run the following command to start the vLLM server: ```bash python -m vllm.entrypoints.openai.api_server \ --served-model-name llama3-8b \ --model /YOUR_LOCAL_PATH/AnonymousAuthors/License-Llama3-8B \ --gpu-memory-utilization 0.9 \ --tensor-parallel-size 1 \ --host 0.0.0.0 \ --port 8000 ``` Then you can request the server to identify license terms: ```python from openai import OpenAI client = OpenAI( api_key='EMPTY', base_url='http://127.0.0.1:8000/v1', ) def license_extract(query, model_type='llama3-8b', max_tokens=2048, temperature=0.3, top_p=0.7): resp = client.completions.create( model=model_type, prompt=query, max_tokens=max_tokens, temperature=temperature, top_p=top_p, seed=42) response = resp.choices[0].text return response # An example of extracting license terms license_text = "you may convey modified covered source (with the effect that you shall also become a licensor) provided that you: a) retain notices as required in subsection 3.2; and b) add a notice to the modified covered source stating that you have modified it, with the date and brief description of how you have modified it." # For the definition of Terms and Attitudes, please refer to the previous section prompt = create_prompt(Terms, Attitudes, license_text) response = license_extract(prompt, model_type='llama3-8b', max_tokens=1500, temperature=0.3, top_p=0.7) print(f"License Text: {license_text}\n") print(f"LLM Response: {response}\n") ``` ⚠️ **NOTE**:According to our multiple tests, we found that under the same inference parameters, the model performance is better when using vLLM for inference.