HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1

Nov 1, 2024

Hi,
I had tried the KaLM-embedding-multilingual-mini-v1 prior to this, would be great to give the instruct finetuned model a try, would be great to know when you would be releasing the model weights for this one?

Thanks

YanshekWoo

HITsz-Text Machine Group org Nov 1, 2024

Hi,
I had tried the KaLM-embedding-multilingual-mini-v1 prior to this, would be great to give the instruct finetuned model a try, would be great to know when you would be releasing the model weights for this one?

Thanks

1. Instruct

The basic format of query instruct for asymmetric tasks is as follows (no instruct or prompt for symmetric tasks like STS):

Instruct: {instruct} \n Query:  {query}

Detailed instruct for testing

(1) Retrieval & Reranking

Given a query, retrieve documents that answer the query.

This is only for HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1 model. No need to add instruct for Retrieval and Reranking tasks for HIT-TMG/KaLM-embedding-multilingual-mini-v1 model.

(2) Classification

You can refer to Alibaba-NLP/gte-Qwen2-7B-instruct. We have made slight changes.

{
            'AmazonCounterfactualClassification': 'Given an Amazon review, judge whether it is counterfactual.',
            'AmazonPolarityClassification': 'Classifying Amazon reviews into positive or negative sentiment',
            'AmazonReviewsClassification': 'Classifying the given Amazon review into its appropriate rating category',
            'Banking77Classification': 'Given a online banking query, find the corresponding intents',
            'EmotionClassification': 'Classifying the emotion expressed in the given Twitter message into one of the six emotions: anger, fear, joy, love, sadness, and surprise',
            'ImdbClassification': 'Classifying the sentiment expressed in the given movie review text from the IMDB dataset',
            'MassiveIntentClassification': 'Given a user utterance as query, find the user intents',
            'MassiveScenarioClassification': 'Given a user utterance as query, find the user scenarios',
            'MTOPDomainClassification': 'Classifying the intent domain of the given utterance in task-oriented conversation',
            'MTOPIntentClassification': 'Classifying the intent of the given utterance in task-oriented conversation',
            'ToxicConversationsClassification': 'Classifying the given comments as either toxic or not toxic',
            'TweetSentimentExtractionClassification': 'Classifying the sentiment of a given tweet as either positive, negative, or neutral',
            # C-MTEB eval instructions
            'TNews': 'Categorizing the given news title',
            'IFlyTek': 'Given an App description text, find the appropriate fine-grained category',
            'MultilingualSentiment': 'Classifying sentiment of the customer review into positive, or negative',
            'JDReview': 'Classifying sentiment of the customer review for iPhoneinto positive or negative',
            'OnlineShopping': 'Classifying sentiment of the customer reviewinto positive or negative',
            'Waimai': 'Classify the customer review from a food takeaway platform into positive or negative',
            # MTEB-fr eval instructions
            'MasakhaNEWSClassification': 'Classifying the category of french news.',
            # MTEB-pl eval instructions
            "CBD":"Classifying the sentiment of polish tweet reviews",
            "PolEmo2.0-IN": "Classifying the sentiment of in-domain (medicine and hotels) online reviews",
            "PolEmo2.0-OUT":"Classifying the sentiment of out-of-domain (products and school) online reviews",
            "AllegroReviews": "Classifying the sentiment of reviews from e-commerce marketplace Allegro",
            "PAC": "Classifying the sentence into one of the two types: \"BEZPIECZNE_POSTANOWIENIE_UMOWNE\" and \"KLAUZULA_ABUZYWNA\"",
            # MTEB-ru eval instructions
            "GeoreviewClassification": "Classifying the sentiment of Russian reviews.",
            "HeadlineClassification": "Classifying the topic of Russian headlines.",
            "InappropriatenessClassification": "Detecting inappropriate messages on sensitive topics",
            "KinopoiskClassification": "Classifying the sentiment of Kinopoisk reviews.",
            "RuReviewsClassification": "Classifying the sentiment of Russian product reviews.",
            "RuSciBenchGRNTIClassification": "Classifying the topic of Russian scientific papers.",
            "RuSciBenchOECDClassification": "Classifying the topic of Russian scientific papers.",
            "CEDRClassification": "Classification of sentences by emotions.",
            "SensitiveTopicsClassification": "Detecting inappropriate messages on sensitive topics.",
        }

(3) Clustering

{
            'ArxivClusteringP2P': 'Identify the main and secondary category of Arxiv papers based on the titles and abstracts',
            'ArxivClusteringS2S': 'Identify the main and secondary category of Arxiv papers based on the titles',
            'BiorxivClusteringP2P': 'Identify the main category of Biorxiv papers based on the titles and abstracts',
            'BiorxivClusteringS2S': 'Identify the main category of Biorxiv papers based on the titles',
            'MedrxivClusteringP2P': 'Identify the main category of Medrxiv papers based on the titles and abstracts',
            'MedrxivClusteringS2S': 'Identify the main category of Medrxiv papers based on the titles',
            'RedditClustering': 'Identify the topic or theme of Reddit posts based on the titles',
            'RedditClusteringP2P': 'Identify the topic or theme of Reddit posts based on the titles and posts',
            'StackExchangeClustering': 'Identify the topic or theme of StackExchange posts based on the titles',
            'StackExchangeClusteringP2P': 'Identify the topic or theme of StackExchange posts based on the given paragraphs',
            'TwentyNewsgroupsClustering': 'Identify the topic or theme of the given news articles',
            # C-MTEB eval instructions
            'CLSClusteringS2S': 'Identify the main category of scholar papers based on the titles',
            'CLSClusteringP2P': 'Identify the main category of scholar papers based on the titles and abstracts',
            'ThuNewsClusteringS2S': 'Identify the topic or theme of the given news articles based on the titles',
            'ThuNewsClusteringP2P': 'Identify the topic or theme of the given news articles based on the titles and contents',
            # MTEB-fr eval instructions
            "AlloProfClusteringP2P": "Identify the main category of Allo Prof document based on the titles and descriptions",
            "AlloProfClusteringS2S": "Identify the main category of Allo Prof document based on the titles",
            "HALClusteringS2S": "Identify the main category of academic passage based on the titles and contents",
            "MasakhaNEWSClusteringP2P": "Identify the topic or theme of the given news articles based on the titles and contents",
            "MasakhaNEWSClusteringS2S": "Identify the topic or theme of the given news articles based on the titles",
            "MLSUMClusteringP2P": "Identify the topic or theme of the given articles based on the titles and contents",
            "MLSUMClusteringS2S":  "Identify the topic or theme of the given articles based on the titles",
            # MTEB-pl eval instructions
            "EightTagsClustering": "Identify of headlines from social media posts in Polish  into 8 categories: film, history, food, medicine, motorization, work, sport and technology",
            # MTEB-ru eval instructions
            "GeoreviewClusteringP2P": "Identify the topic or theme of the Russian reviews.",
            "RuSciBenchGRNTIClusteringP2P": "Identify the topic or theme of the Russian articles.",
            "RuSciBenchOECDClusteringP2P": "Identify the topic or theme of the Russian articles.",
 }

1.2 Detailed instruct for fine-tuning

You can follow the test instruct above for fine-tuning. It is better to add more diverse instruct data for training via LLM, and you can refer to the paper Improving Text Embeddings with Large Language Models.

2. Model weights

we are releasing the model weights now, so you can download it soon.

We are continuously optimizing our model and releasing new versions. The specific technical details will be disclosed at an appropriate time. If you have any further questions, please feel free to ask.

rjmehta

Jan 13

What if the query is multilingual? Do you use English instructions?

YanshekWoo

HITsz-Text Machine Group org Jan 14