dmr76's picture
Update README.md
a3d128f
|
raw
history blame
5.03 kB
metadata
language:
  - en
tags:
  - roberta
  - marketing mix
  - multi-label
  - classification
  - microblog
  - tweets
widget:
  - text: >-
      Best cushioning ever!!! 🤗🤗🤗 my zoom vomeros are the bomb🏃🏽‍♀️💨!!!
      @nike #run #training
  - text: >-
      Why is @BestBuy always sold-out of Apple's new airpods in their online
      shop 🤯😡?
  - text: They’re closing the @Aldo at the Lehigh Vally Mall and KOP 😭
  - text: >-
      @Sony’s XM3’s ain’t as sweet as my bro’s airpod pros but got a real steal
      🤑 the other day #deal #headphonez
  - text: >-
      Nike needs to sponsor more e-sports atheletes with Air Jordans! #nike
      #esports
  - text: >-
      Say what you want about @Abercrombie's 90s shirtless males ads, they made
      dang good woll sweaters back in the day. This is one of 3 I have from the
      late 90s.
  - text: >-
      To celebrate this New Year, @Nordstrom is DOUBLING all donations up to
      $25,000! 🎉 Your donation will help us answer 2X the calls, texts, and
      chats that come in, and allow us to train 2X more volunteers!
  - text: >-
      It's inspiring to see religious leaders speaking up for workers' rights
      and fair wages. Every voice matters in the #FightFor15! 💪🏽✊🏼
      #Solidarity #WorkersRights

Model Card for: mmx_classifier_microblog_ENv02

Multi-label classifier that identifies which marketing mix variable(s) a microblog post pertains to.

Version: 0.2 from August 16, 2023

Model Details

You can use this classifier to determine which of the 4P's of marketing, also known as marketing mix variables, a microblog post (e.g., Tweet) pertains to:

  1. Product
  2. Place
  3. Price
  4. Promotion

Model Description

This classifier is a fine-tuned checkpoint of [cardiffnlp/twitter-roberta-large-2022-154m] (https://huggingface.co./cardiffnlp/twitter-roberta-large-2022-154m). It was trained on 15K Tweets that mentioned at least one of 699 brands. The Tweets were first cleaned and then labeled using OpenAI's GPT4.

Because this is a multi-label classification problem, we use binary cross-entropy (BCE) with logits loss for the fine-tuning. We basically combine a sigmoid layer with BCELoss in a single class. To obtain the probabilities for each label (i.e., marketing mix variable), you need to "push" the predictions through a sigmoid function. This is already done in the accompanying python notebook.

IMPORTANT At the time of writing this description, Huggingface's pipeline did not support multi-label classifiers.

Working Paper

Download the working paper from SSRN: "Creating Synthetic Experts with Generative AI"

Quickstart

# Imports
import pandas as pd, numpy as np, warnings, torch, re
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from bs4 import BeautifulSoup
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
# Helper Functions
def clean_and_parse_tweet(tweet):
    tweet = re.sub(r"https?://\S+|www\.\S+", " URL ", tweet)
    parsed = BeautifulSoup(tweet, "html.parser").get_text() if "filename" not in str(BeautifulSoup(tweet, "html.parser")) else None
    return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+|\n+", " ", parsed or tweet)).strip()) if parsed else None
def predict_tweet(tweet, model, tokenizer, device, threshold=0.5):
    inputs = tokenizer(tweet, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    probs = torch.sigmoid(model(**inputs).logits).detach().cpu().numpy()[0]
    return probs, [id2label[i] for i, p in enumerate(probs) if id2label[i] in {'Product', 'Place', 'Price', 'Promotion'} and p >= threshold]
# Setup
device = "mps" if torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
synxp = "dmr76/mmx_classifier_microblog_ENv02"
model = AutoModelForSequenceClassification.from_pretrained(synxp).to(device)
tokenizer = AutoTokenizer.from_pretrained(synxp)
id2label = model.config.id2label
# ---->>> Define your Tweet  <<<----
tweet = "Best cushioning ever!!! 🤗🤗🤗  my zoom vomeros are the bomb🏃🏽‍♀️💨!!!  \n @nike #run #training https://randomurl.ai"
# Clean and Predict
cleaned_tweet = clean_and_parse_tweet(tweet)
probs, labels = predict_tweet(cleaned_tweet, model, tokenizer, device)
# Print Labels and Probabilities
print("Please don't forget to cite the paper: https://ssrn.com/abstract=4542949 in you use this code")
print(labels, probs)

Conveniently predict thousands tweets with the batch processing python notebook, available in my GitHub Repository

Citation

Please cite the following reference if you use synthetic experts in your work:

Ringel, Daniel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://ssrn.com/abstract=4542949

Additional Ressources

www.synthetic-experts.ai
GitHub Repository