kistepAI/SPARK-Summarization

1. Description

SPARK-Summarization is a large language model developed by the Korea Institute of S&T Evaluation and Planning (KISTEP). This model specializes in summarization tasks and utilizes Chain of Density (CoD) reasoning to provide high-quality, condensed summaries in both Korean and English.

2. Key Features

Enhanced Summarization through CoD: Delivers high-quality summaries using the Chain of Density approach, ensuring comprehensive yet concise output.
Multilingual Support: Capable of processing and generating summaries in both Korean and English.
Structured Output: Provides summaries in a bullet-point format for improved readability and quick comprehension.
Base Model: Built on Mistral-nemo as the foundation model
Training Method: Trained with Supervised Fine-Tuning (SFT)
Context Length: The maximum context length for training data is 16,384.

3. Data

source	KISTEP Documents
count	24,417

4. Usage

When using ollama, you can utilize the Modelfile.
Python code

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "kistepAI/SPARK-Summarization"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

messages = [
    {"role": "user", "content": "안녕하세요."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("</s>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.3,
    top_p=0.95,
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

Recommended Prompt Template (input: {TITLE}, {DOCUMENT})

propmt_template: |
    당신은 요약 전문가입니다. 주어진 텍스트를 참고하여 요약을 작성하세요.
    
    ## 요약 단계:
    1. 텍스트 분석:
        - 문서 제목과 텍스트를 주의 깊게 읽고, 문서의 주요 주제를 파악하세요.
    2. 주요 주장(key_argument) 식별:
        - 다음 질문에 답변하기: "이 텍스트의 주요 주장 또는 핵심 논점은 무엇인가?"
    3. 주요 개체(entities) 추출: 
        - 5단어 이하의 주요 개체 3개를 뽑아주세요.
    4. 요약문의 주제(title) 생성: 
        - 제공된 텍스트에 대한 간결한 한문장의 주제를 생성하세요.
    5. 요약(summary) 작성: 
        - 주요 주장과 주요 개체, 주제를 참고하여 텍스트의 주요 내용을 요약하세요.
        
    ## 향상 단계
    6. 밀도 향상:
        - 초기 요약에 포함되지 않은 1~3개의 추가 설명 개체를 식별하세요.
        - 이전 및 새 개체를 모두 통합하여 요약의 밀도가 높은 버전을 작성하세요.
    7. 중요도 평가:
        - 이전 요약에서 필수적인 부분을 강조하고 덜 중요한 부분을 줄여서 수정하세요.
        - 새 요약이 주요 주장과 밀접하게 일치하는지 확인하세요.
    8. 유창성 향상:
        - 문법, 단어 선택, 표현을 다듬어 가독성과 자연스러운 흐름을 향상시키세요.
        - 요약 세부내용의 정확성과 완전성을 유지하면서 문장 구조를 개선하세요.
    
    ## 작성 방식:
        - 문서를 소개하는 대신 요약 내용만 작성하세요.
        - 구체적인 데이터나 수치보다는 전체 흐름과 방향을 설명하세요.
        - 주어진 내용에만 기반해 객관적으로 작성하세요.
        - 한국어로 작성하되, 영어 기술 용어와 고유 명사는 그대로 사용하세요.
    
    
    ## 입력:
    ### 문서 제목:
    {TITLE}
    ### 텍스트:
    {DOCUMENT}
    ## 출력 형식:
    <reason>
    초기 주요 주장: [초기 주요 주장]
    초기 주요 개체: [초기 주요 개체 목록]
    초기 제목: [초기 제목]
    초기 요약: [초기 요약 내용]
    
    밀도 향상 단계:
    새로 추가된 주요 개체: [새로 추가된 주요 개체 목록(with bullet points)]
    사고 과정: [주요 개체 선택 및 요약 작성에 대한 설명]
    업데이트 제목: [업데이트 제목]
    업데이트 요약: [업데이트 요약 내용]
    
    중요도 평가 단계:
    사고 과정: [요약 관련성 향상을 위한 중요도 평가 및 변경된 사항에 대한 설명]
    업데이트 제목: [업데이트 제목]
    업데이트 요약: [업데이트 요약 내용]
    
    언어 유청성 단계:
    사고 과정: [언어 명확성과 유창성을 개선하기 위해 변경된 사항에 대한 설명]
    업데이트 제목: [업데이트 제목]
    Updated Summary: [요약의 각 문장 목록(with bullet points)]
    </reason>
    
    <output>
        <key_argument>[주요 주장(한국어)]</key_argument>
        <entities>[주요 개체 목록, 쉼표로 구분]</entities>
        <title>[주제(한국어)]</title>
        <summary>
            <point>[첫번째 요약 문장(한국어)]</point>
            <point>[두번째 요약 문장(한국어)]</point>
            ...
        </summary>
    </output>

5. Benchmark

TBD

kistepAI
/

SPARK-Summarization

You need to agree to share your contact information to access this model

1. Description

2. Key Features

3. Data

4. Usage

5. Benchmark

Model tree for kistepAI/SPARK-Summarization