{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Foong_Coding Challenge for Fatima Fellowship", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "eBpjBBZc6IvA" }, "source": [ "# Fatima Fellowship Quick Coding Challenge (Pick 1)\n", "\n", "Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge. Please pick **1 of these 5** coding challenges, whichever is most aligned with your interests. \n", "\n", "**Due date: 1 week**\n", "\n", "**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook to the submission link below. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw).\n", "\n", "**Submission link**: https://airtable.com/shrXy3QKSsO2yALd3" ] }, { "cell_type": "markdown", "metadata": { "id": "sFU9LTOyMiMj" }, "source": [ "# 2. Deep Learning for NLP\n", "\n", "**Fake news classifier**: Train a text classification model to detect fake news articles!\n", "\n", "* Download the dataset here: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset\n", "* Develop an NLP model for classification that uses a pretrained language model\n", "* Finetune your model on the dataset, and generate an AUC curve of your model on the test set of your choice. \n", "* [Upload the the model to the Hugging Face Hub](https://huggingface.co./docs/hub/adding-a-model), and add a link to your model below.\n", "* *Answer the following question*: Look at some of the news articles that were classified incorrectly. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)" ] }, { "cell_type": "code", "source": [ "### WRITE YOUR CODE TO TRAIN THE MODEL HERE\n", "import numpy as np\n", "import pandas as pd\n", "import csv\n", "from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n", "\n" ], "metadata": { "id": "E90i018KyJH3" }, "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Data Loading" ], "metadata": { "id": "HUDOBz2tRivY" } }, { "cell_type": "code", "source": [ "real_news = pd.read_csv(\"True.csv\", sep=',', engine='python', encoding='utf8',on_bad_lines='skip')\n", "fake_news = pd.read_csv(\"Fake.csv\", sep=',', engine='python', encoding='utf8',on_bad_lines='skip')\n", "\n", "print(\"real_news: \" + str(real_news.shape))\n", "print(\"fake_news: \" + str(fake_news.shape))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "d60sCvRjOSWa", "outputId": "99813f74-971d-41e2-8597-4913ca131fe1" }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "real_news: (21417, 4)\n", "fake_news: (14568, 4)\n" ] } ] }, { "cell_type": "code", "source": [ "fake_news.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "ywYW2xTuOVGy", "outputId": "2e442a61-4634-4965-a6f7-822896f45dbb" }, "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 Donald Trump Sends Out Embarrassing New Year’... \n", "1 Drunk Bragging Trump Staffer Started Russian ... \n", "2 Sheriff David Clarke Becomes An Internet Joke... \n", "3 Trump Is So Obsessed He Even Has Obama’s Name... \n", "4 Pope Francis Just Called Out Donald Trump Dur... \n", "\n", " text subject \\\n", "0 Donald Trump just couldn t wish all Americans ... News \n", "1 House Intelligence Committee Chairman Devin Nu... News \n", "2 On Friday, it was revealed that former Milwauk... News \n", "3 On Christmas day, Donald Trump announced that ... News \n", "4 Pope Francis used his annual Christmas Day mes... News \n", "\n", " date \n", "0 December 31, 2017 \n", "1 December 31, 2017 \n", "2 December 30, 2017 \n", "3 December 29, 2017 \n", "4 December 25, 2017 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdate
0Donald Trump Sends Out Embarrassing New Year’...Donald Trump just couldn t wish all Americans ...NewsDecember 31, 2017
1Drunk Bragging Trump Staffer Started Russian ...House Intelligence Committee Chairman Devin Nu...NewsDecember 31, 2017
2Sheriff David Clarke Becomes An Internet Joke...On Friday, it was revealed that former Milwauk...NewsDecember 30, 2017
3Trump Is So Obsessed He Even Has Obama’s Name...On Christmas day, Donald Trump announced that ...NewsDecember 29, 2017
4Pope Francis Just Called Out Donald Trump Dur...Pope Francis used his annual Christmas Day mes...NewsDecember 25, 2017
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "source": [ "## Add labeling" ], "metadata": { "id": "ZghmfpC2SIVC" } }, { "cell_type": "code", "source": [ "fake_news['label'] = 0 \n", "real_news['label'] = 1" ], "metadata": { "id": "rZ8pF-RtSJ6_" }, "execution_count": 4, "outputs": [] }, { "cell_type": "code", "source": [ "fake_news.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "CR_yBlbRR6R4", "outputId": "f2eff41d-8cfc-44cf-d68c-313cb692fb45" }, "execution_count": 5, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 Donald Trump Sends Out Embarrassing New Year’... \n", "1 Drunk Bragging Trump Staffer Started Russian ... \n", "2 Sheriff David Clarke Becomes An Internet Joke... \n", "3 Trump Is So Obsessed He Even Has Obama’s Name... \n", "4 Pope Francis Just Called Out Donald Trump Dur... \n", "\n", " text subject \\\n", "0 Donald Trump just couldn t wish all Americans ... News \n", "1 House Intelligence Committee Chairman Devin Nu... News \n", "2 On Friday, it was revealed that former Milwauk... News \n", "3 On Christmas day, Donald Trump announced that ... News \n", "4 Pope Francis used his annual Christmas Day mes... News \n", "\n", " date label \n", "0 December 31, 2017 0 \n", "1 December 31, 2017 0 \n", "2 December 30, 2017 0 \n", "3 December 29, 2017 0 \n", "4 December 25, 2017 0 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdatelabel
0Donald Trump Sends Out Embarrassing New Year’...Donald Trump just couldn t wish all Americans ...NewsDecember 31, 20170
1Drunk Bragging Trump Staffer Started Russian ...House Intelligence Committee Chairman Devin Nu...NewsDecember 31, 20170
2Sheriff David Clarke Becomes An Internet Joke...On Friday, it was revealed that former Milwauk...NewsDecember 30, 20170
3Trump Is So Obsessed He Even Has Obama’s Name...On Christmas day, Donald Trump announced that ...NewsDecember 29, 20170
4Pope Francis Just Called Out Donald Trump Dur...Pope Francis used his annual Christmas Day mes...NewsDecember 25, 20170
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "## Combine Real & Fake News into one dataframe" ], "metadata": { "id": "ZB2C1ImfSUUg" } }, { "cell_type": "code", "source": [ "news = pd.concat([real_news,fake_news],axis=0,ignore_index=True)\n", "news = news.sample(frac = 1).reset_index(drop = True)\n", "news.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "RTifEXcHSQJ0", "outputId": "d2e996c9-9068-4cfb-dbe0-1fb84c4b0b2f" }, "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 Trump’s Involvement In Houston Chemical Plant... \n", "1 OOPS! Media Forgot Ted Kennedy Asked Russia To... \n", "2 OBAMA GIVES FINAL THOUGHTS On Trump Presidency... \n", "3 CNN ANCHOR DON LEMON: A Republican Winning in ... \n", "4 Trump Confirms He Thinks GOP Healthcare Bill ... \n", "\n", " text subject \\\n", "0 In the aftermath of the historic flooding that... News \n", "1 In 1991 a reporter for the London Times found ... politics \n", "2 The Obama family ended their eight-year reside... politics \n", "3 CNN anchor Don Lemon got snarky during reporti... politics \n", "4 Trump got into a bizarre pissing match with fo... News \n", "\n", " date label \n", "0 September 1, 2017 0 \n", "1 Feb 16, 2017 0 \n", "2 Jan 20, 2017 0 \n", "3 Jun 21, 2017 0 \n", "4 June 25, 2017 0 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdatelabel
0Trump’s Involvement In Houston Chemical Plant...In the aftermath of the historic flooding that...NewsSeptember 1, 20170
1OOPS! Media Forgot Ted Kennedy Asked Russia To...In 1991 a reporter for the London Times found ...politicsFeb 16, 20170
2OBAMA GIVES FINAL THOUGHTS On Trump Presidency...The Obama family ended their eight-year reside...politicsJan 20, 20170
3CNN ANCHOR DON LEMON: A Republican Winning in ...CNN anchor Don Lemon got snarky during reporti...politicsJun 21, 20170
4Trump Confirms He Thinks GOP Healthcare Bill ...Trump got into a bizarre pissing match with fo...NewsJune 25, 20170
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "source": [ "news['combine'] = news['title'] + ' ' + news['text']\n", "news.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 337 }, "id": "N7QZ7Zk5VvDk", "outputId": "1abb083b-33d5-4e82-a14f-7bd943231d9e" }, "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title \\\n", "0 Trump’s Involvement In Houston Chemical Plant... \n", "1 OOPS! Media Forgot Ted Kennedy Asked Russia To... \n", "2 OBAMA GIVES FINAL THOUGHTS On Trump Presidency... \n", "3 CNN ANCHOR DON LEMON: A Republican Winning in ... \n", "4 Trump Confirms He Thinks GOP Healthcare Bill ... \n", "\n", " text subject \\\n", "0 In the aftermath of the historic flooding that... News \n", "1 In 1991 a reporter for the London Times found ... politics \n", "2 The Obama family ended their eight-year reside... politics \n", "3 CNN anchor Don Lemon got snarky during reporti... politics \n", "4 Trump got into a bizarre pissing match with fo... News \n", "\n", " date label combine \n", "0 September 1, 2017 0 Trump’s Involvement In Houston Chemical Plant... \n", "1 Feb 16, 2017 0 OOPS! Media Forgot Ted Kennedy Asked Russia To... \n", "2 Jan 20, 2017 0 OBAMA GIVES FINAL THOUGHTS On Trump Presidency... \n", "3 Jun 21, 2017 0 CNN ANCHOR DON LEMON: A Republican Winning in ... \n", "4 June 25, 2017 0 Trump Confirms He Thinks GOP Healthcare Bill ... " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titletextsubjectdatelabelcombine
0Trump’s Involvement In Houston Chemical Plant...In the aftermath of the historic flooding that...NewsSeptember 1, 20170Trump’s Involvement In Houston Chemical Plant...
1OOPS! Media Forgot Ted Kennedy Asked Russia To...In 1991 a reporter for the London Times found ...politicsFeb 16, 20170OOPS! Media Forgot Ted Kennedy Asked Russia To...
2OBAMA GIVES FINAL THOUGHTS On Trump Presidency...The Obama family ended their eight-year reside...politicsJan 20, 20170OBAMA GIVES FINAL THOUGHTS On Trump Presidency...
3CNN ANCHOR DON LEMON: A Republican Winning in ...CNN anchor Don Lemon got snarky during reporti...politicsJun 21, 20170CNN ANCHOR DON LEMON: A Republican Winning in ...
4Trump Confirms He Thinks GOP Healthcare Bill ...Trump got into a bizarre pissing match with fo...NewsJune 25, 20170Trump Confirms He Thinks GOP Healthcare Bill ...
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "markdown", "source": [ "## Tfidf Vectorization" ], "metadata": { "id": "_q-ySMGeThhL" } }, { "cell_type": "code", "source": [ "import nltk\n", "from nltk import word_tokenize\n", "from nltk.stem import SnowballStemmer\n", "from nltk.corpus import stopwords \n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "nltk.download('punkt')\n", "\n", "# Tokenizing\n", "news['combine'] = news['combine'].apply(lambda x: word_tokenize(str(x)))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oLbvalr4Tort", "outputId": "bd3a981d-2e13-46b3-e7df-fe9fc28a867d" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n" ] } ] }, { "cell_type": "code", "source": [ "# Stemming\n", "snowball = SnowballStemmer(language='english')\n", "news['combine'] = news['combine'].apply(lambda x: [snowball.stem(y) for y in x])" ], "metadata": { "id": "KedYPGIFTsyz" }, "execution_count": 9, "outputs": [] }, { "cell_type": "code", "source": [ "news['combine'] = news['combine'].apply(lambda x: ' '.join(x))" ], "metadata": { "id": "QvVafpZVT5BJ" }, "execution_count": 10, "outputs": [] }, { "cell_type": "code", "source": [ "tfidf = TfidfVectorizer()\n", "X_text = tfidf.fit_transform(news['combine'])" ], "metadata": { "id": "pMT8lagAT-hJ" }, "execution_count": 11, "outputs": [] }, { "cell_type": "code", "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X_text, news['label'], test_size=0.3, random_state=1)" ], "metadata": { "id": "RJ3rBOD7Sli2" }, "execution_count": 29, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Data Modeling - Support Vector Machine" ], "metadata": { "id": "vBd9mLdpWcg8" } }, { "cell_type": "code", "source": [ "from sklearn.svm import LinearSVC\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import roc_auc_score\n", "\n", "clf = LinearSVC(max_iter=100, C=1.0)\n", "clf.fit(X_train, y_train)\n", "\n", "y_pred = clf.predict(X_test)\n", "print(\"Cross validation score:\")\n", "print(cross_val_score(clf, X_text, news['label'], cv=3))\n", "\n", "print(\"\\nAccuracy:\")\n", "print(accuracy_score(y_pred, y_test))\n", "\n", "print(\"\\nConfusion Matrix:\")\n", "print(confusion_matrix(y_pred, y_test))\n", "\n", "print(\"\\nROC AUC:\")\n", "print(roc_auc_score(y_pred, y_test))\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ViT1BRrBWgTi", "outputId": "4ad7ab1b-faeb-4199-a4ad-36c7a2b506cf" }, "execution_count": 30, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Cross validation score:\n", "[0.99566486 0.99433097 0.99516465]\n", "\n", "Accuracy:\n", "0.9954612819562801\n", "\n", "Confusion Matrix:\n", "[[4416 19]\n", " [ 30 6331]]\n", "\n", "ROC AUC:\n", "0.9954998283473117\n" ] } ] }, { "cell_type": "markdown", "source": [ "## Find out the misclassified" ], "metadata": { "id": "Gkb2PvDx7YY6" } }, { "cell_type": "code", "source": [ "y_test_1 = np.asarray(y_test)\n", "misclassified = np.where(y_test_1 != clf.predict(X_test))\n", "misclassified" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HoAm7-MQ67e0", "outputId": "39cc09df-b7c1-4a12-b708-debb97a5493e" }, "execution_count": 32, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(array([ 479, 875, 900, 964, 1115, 1332, 1808, 2002, 2008,\n", " 2364, 2495, 2811, 3009, 3332, 4407, 4495, 4633, 4636,\n", " 4680, 4864, 4934, 5376, 5426, 5519, 5764, 6018, 6021,\n", " 6046, 6202, 6223, 6267, 6537, 6744, 6832, 6938, 7042,\n", " 7305, 7572, 7798, 7986, 8645, 8970, 9176, 9440, 9653,\n", " 10068, 10122, 10229, 10283]),)" ] }, "metadata": {}, "execution_count": 32 } ] }, { "cell_type": "code", "source": [ "news.iloc[479]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Mpfzn-PQ7J1c", "outputId": "fe0e4277-03ba-4024-dbd3-321915e7d9e1" }, "execution_count": 33, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "title Russia hits Islamic State with bomb raids, mis...\n", "text MOSCOW (Reuters) - Russia has carried out 18 b...\n", "subject worldnews\n", "date November 3, 2017 \n", "label 1\n", "combine russia hit islam state with bomb raid , missil...\n", "Name: 479, dtype: object" ] }, "metadata": {}, "execution_count": 33 } ] }, { "cell_type": "code", "source": [ "y_pred[479]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "i8i9FUnw7PkQ", "outputId": "1730ceba-1db0-41a3-e2f4-b3608d1dfd18" }, "execution_count": 36, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0" ] }, "metadata": {}, "execution_count": 36 } ] }, { "cell_type": "code", "source": [ "news.iloc[875]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ha2q7JvE7dcw", "outputId": "d29668d1-9173-4fec-9b96-8d0139081486" }, "execution_count": 37, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "title Trump Gets HUMILIATED After Whining About Sta...\n", "text Donald Trump threw a temper tantrum on Saturda...\n", "subject News\n", "date July 1, 2017\n", "label 0\n", "combine trump get humili after whine about state refus...\n", "Name: 875, dtype: object" ] }, "metadata": {}, "execution_count": 37 } ] }, { "cell_type": "code", "source": [ "y_pred[875]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XZSWP-zk7gNk", "outputId": "51c10cdb-6729-4dda-88b9-2d28fd9919b1" }, "execution_count": 38, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1" ] }, "metadata": {}, "execution_count": 38 } ] }, { "cell_type": "markdown", "source": [ "## Optional Code" ], "metadata": { "id": "0OpbUEpq4IpV" } }, { "cell_type": "code", "source": [ "# news_out = pd.merge(news,y_test,how = 'left',left_index = True, right_index = True)\n", "# temp = news_out[~(news_out[['label_y']].isnull().any(axis=1))]\n", "# temp.loc[(temp['label_x'] == temp['label_y'])]" ], "metadata": { "id": "_pwDcqx1qOby" }, "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "source": [ "**Write up**: \n", "* Link to the model on Hugging Face Hub: \n", "* Include some examples of misclassified news articles. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)" ], "metadata": { "id": "kpInVUMLyJ24" } } ] }