dmr76's picture
Update README.md
98349c6
|
raw
history blame
5 kB
metadata
language:
  - en
tags:
  - roberta
  - marketing mix
  - multi-label
  - classification
  - microblog
  - tweets
widget:
  - text: >-
      Best cushioning ever!!! 🤗🤗🤗 my zoom vomeros are the bomb🏃🏽‍♀️💨!!!
      @nike #run #training
  - text: >-
      Why is @BestBuy always sold-out of Apple's new airpods in their online
      shop 🤯😡?
  - text: They’re closing the @Aldo at the Lehigh Vally Mall and KOP 😭
  - text: >-
      @Sony’s XM3’s ain’t as sweet as my bro’s airpod pros but got a real steal
      🤑 the other day #deal #headphonez
  - text: >-
      Nike needs to sponsor more e-sports atheletes with Air Jordans! #nike
      #esports
  - text: >-
      Say what you want about @Abercrombie's 90s shirtless males ads, they made
      dang good woll sweaters back in the day. This is one of 3 I have from the
      late 90s.
  - text: >-
      To celebrate this New Year, @Nordstrom is DOUBLING all donations up to
      $25,000! 🎉 Your donation will help us answer 2X the calls, texts, and
      chats that come in, and allow us to train 2X more volunteers!
  - text: >-
      It's inspiring to see religious leaders speaking up for workers' rights
      and fair wages. Every voice matters in the #FightFor15! 💪🏽✊🏼
      #Solidarity #WorkersRights

Model Card for: mmx_classifier_microblog_ENv02

Multi-label classifier that identifies which marketing mix variable(s) a microblog post pertains to.

Version: 0.2 from August 16, 2023

Model Details

You can use this classifier to determine which of the 4P's of marketing, also known as marketing mix variables, a microblog post (e.g., Tweet) pertains to:

  1. Product
  2. Place
  3. Price
  4. Promotion

Model Description

This classifier is a fine-tuned checkpoint of [cardiffnlp/twitter-roberta-large-2022-154m] (https://huggingface.co./cardiffnlp/twitter-roberta-large-2022-154m). It was trained on 15K Tweets that mentioned at least one of 699 brands. The Tweets were first cleaned and then labeled using OpenAI's GPT4.

Because this is a multi-label classification problem, we use binary cross-entropy (BCE) with logits loss for the fine-tuning. We basically combine a sigmoid layer with BCELoss in a single class. To obtain the probabilities for each label (i.e., marketing mix variable), you need to "push" the predictions through a sigmoid function. This is already done in the accompanying python notebook.

IMPORTANT At the time of writing this description, Huggingface's pipeline did not support multi-label classifiers.

Working Paper

Download the working paper from SSRN: "Creating Synthetic Experts with Generative AI"

Quickstart

# Imports
import pandas as pd, numpy as np, warnings, torch, re
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from bs4 import BeautifulSoup
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Helper Functions
def clean_and_parse_tweet(tweet):
    tweet = re.sub(r"https?://\S+|www\.\S+", " URL ", tweet)
    parsed = BeautifulSoup(tweet, "html.parser").get_text() if "filename" not in str(BeautifulSoup(tweet, "html.parser")) else None
    return re.sub(r" +", " ", re.sub(r'^[.:]+', '', re.sub(r"\\n+|\n+", " ", parsed or tweet)).strip()) if parsed else None

def predict_tweet(tweet, model, tokenizer, device, threshold=0.5):
    inputs = tokenizer(tweet, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)
    probs = torch.sigmoid(model(**inputs).logits).detach().cpu().numpy()[0]
    return probs, [id2label[i] for i, p in enumerate(probs) if id2label[i] in {'Product', 'Place', 'Price', 'Promotion'} and p >= threshold]

# Setup
device = "mps" if torch.backends.mps.is_built() and torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
synxp = "dmr76/mmx_classifier_microblog_ENv02"
model = AutoModelForSequenceClassification.from_pretrained(synxp).to(device)
tokenizer = AutoTokenizer.from_pretrained(synxp)
id2label = model.config.id2label

# ---->>> Define your Tweet  <<<----
tweet = "Best cushioning ever!!! 🤗🤗🤗  my zoom vomeros are the bomb🏃🏽‍♀️💨!!!  \n @nike #run #training https://randomurl.ai"

# Clean and Predict
cleaned_tweet = clean_and_parse_tweet(tweet)
probs, labels = predict_tweet(cleaned_tweet, model, tokenizer, device)

# Print Labels and Probabilities
print("Please don't forget to cite the paper: https://ssrn.com/abstract=4542949")
print(labels, probs)

Predict thousands tweets with the batch processing python notebook, available in my GitHub Repository

Citation

Please cite the following reference if you use synthetic experts in your work:

Ringel, Daniel, Creating Synthetic Experts with Generative Artificial Intelligence (July 15, 2023). Available at SSRN: https://ssrn.com/abstract=4542949

Additional Ressources

www.synthetic-experts.ai
GitHub Repository