# Fatima Fellowship Quick Coding Challenge (Pick 1)

Thank you for applying to the Fatima Fellowship. To help us select the Fellows and assess your ability to do machine learning research, we are asking that you complete a short coding challenge. Please pick **1 of these 5** coding challenges, whichever is most aligned with your interests. 

**Due date: 1 week**

**How to submit**: Please make a copy of this colab notebook, add your code and results, and submit your colab notebook to the submission link below. If you have never used a colab notebook, [check out this video](https://www.youtube.com/watch?v=i-HnvsehuSw).

**Submission link**: https://airtable.com/shrXy3QKSsO2yALd3

# 2. Deep Learning for NLP

**Fake news classifier**: Train a text classification model to detect fake news articles!

* Download the dataset here: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
* Develop an NLP model for classification that uses a pretrained language model
* Finetune your model on the dataset, and generate an AUC curve of your model on the test set of your choice. 
* [Upload the the model to the Hugging Face Hub](https://huggingface.co./docs/hub/adding-a-model), and add a link to your model below.
* *Answer the following question*: Look at some of the news articles that were classified incorrectly. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)

In [1]:
### WRITE YOUR CODE TO TRAIN THE MODEL HERE
import numpy as np
import pandas as pd
import csv
from sklearn.metrics import accuracy_score, precision_recall_fscore_support



## Data Loading

In [2]:
real_news = pd.read_csv("True.csv", sep=',', engine='python', encoding='utf8',on_bad_lines='skip')
fake_news = pd.read_csv("Fake.csv", sep=',', engine='python', encoding='utf8',on_bad_lines='skip')

print("real_news: " + str(real_news.shape))
print("fake_news: " + str(fake_news.shape))

real_news: (21417, 4)
fake_news: (14568, 4)


In [3]:
fake_news.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


## Add labeling

In [4]:
fake_news['label'] = 0 
real_news['label'] = 1

In [5]:
fake_news.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


## Combine Real & Fake News into one dataframe

In [6]:
news = pd.concat([real_news,fake_news],axis=0,ignore_index=True)
news = news.sample(frac = 1).reset_index(drop = True)
news.head()

Unnamed: 0,title,text,subject,date,label
0,Trump’s Involvement In Houston Chemical Plant...,In the aftermath of the historic flooding that...,News,"September 1, 2017",0
1,OOPS! Media Forgot Ted Kennedy Asked Russia To...,In 1991 a reporter for the London Times found ...,politics,"Feb 16, 2017",0
2,OBAMA GIVES FINAL THOUGHTS On Trump Presidency...,The Obama family ended their eight-year reside...,politics,"Jan 20, 2017",0
3,CNN ANCHOR DON LEMON: A Republican Winning in ...,CNN anchor Don Lemon got snarky during reporti...,politics,"Jun 21, 2017",0
4,Trump Confirms He Thinks GOP Healthcare Bill ...,Trump got into a bizarre pissing match with fo...,News,"June 25, 2017",0


In [7]:
news['combine'] = news['title'] + ' ' + news['text']
news.head()

Unnamed: 0,title,text,subject,date,label,combine
0,Trump’s Involvement In Houston Chemical Plant...,In the aftermath of the historic flooding that...,News,"September 1, 2017",0,Trump’s Involvement In Houston Chemical Plant...
1,OOPS! Media Forgot Ted Kennedy Asked Russia To...,In 1991 a reporter for the London Times found ...,politics,"Feb 16, 2017",0,OOPS! Media Forgot Ted Kennedy Asked Russia To...
2,OBAMA GIVES FINAL THOUGHTS On Trump Presidency...,The Obama family ended their eight-year reside...,politics,"Jan 20, 2017",0,OBAMA GIVES FINAL THOUGHTS On Trump Presidency...
3,CNN ANCHOR DON LEMON: A Republican Winning in ...,CNN anchor Don Lemon got snarky during reporti...,politics,"Jun 21, 2017",0,CNN ANCHOR DON LEMON: A Republican Winning in ...
4,Trump Confirms He Thinks GOP Healthcare Bill ...,Trump got into a bizarre pissing match with fo...,News,"June 25, 2017",0,Trump Confirms He Thinks GOP Healthcare Bill ...


## Tfidf Vectorization

In [8]:
import nltk
from nltk import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('punkt')

# Tokenizing
news['combine'] = news['combine'].apply(lambda x: word_tokenize(str(x)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
# Stemming
snowball = SnowballStemmer(language='english')
news['combine'] = news['combine'].apply(lambda x: [snowball.stem(y) for y in x])

In [10]:
news['combine'] = news['combine'].apply(lambda x: ' '.join(x))

In [11]:
tfidf = TfidfVectorizer()
X_text = tfidf.fit_transform(news['combine'])

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_text, news['label'], test_size=0.3, random_state=1)

## Data Modeling - Support Vector Machine

In [30]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

clf = LinearSVC(max_iter=100, C=1.0)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Cross validation score:")
print(cross_val_score(clf, X_text, news['label'], cv=3))

print("\nAccuracy:")
print(accuracy_score(y_pred, y_test))

print("\nConfusion Matrix:")
print(confusion_matrix(y_pred, y_test))

print("\nROC AUC:")
print(roc_auc_score(y_pred, y_test))


Cross validation score:
[0.99566486 0.99433097 0.99516465]

Accuracy:
0.9954612819562801

Confusion Matrix:
[[4416   19]
 [  30 6331]]

ROC AUC:
0.9954998283473117


## Find out the misclassified

In [32]:
y_test_1 = np.asarray(y_test)
misclassified = np.where(y_test_1 != clf.predict(X_test))
misclassified

(array([  479,   875,   900,   964,  1115,  1332,  1808,  2002,  2008,
         2364,  2495,  2811,  3009,  3332,  4407,  4495,  4633,  4636,
         4680,  4864,  4934,  5376,  5426,  5519,  5764,  6018,  6021,
         6046,  6202,  6223,  6267,  6537,  6744,  6832,  6938,  7042,
         7305,  7572,  7798,  7986,  8645,  8970,  9176,  9440,  9653,
        10068, 10122, 10229, 10283]),)

In [33]:
news.iloc[479]

title      Russia hits Islamic State with bomb raids, mis...
text       MOSCOW (Reuters) - Russia has carried out 18 b...
subject                                            worldnews
date                                       November 3, 2017 
label                                                      1
combine    russia hit islam state with bomb raid , missil...
Name: 479, dtype: object

In [36]:
y_pred[479]

0

In [37]:
news.iloc[875]

title       Trump Gets HUMILIATED After Whining About Sta...
text       Donald Trump threw a temper tantrum on Saturda...
subject                                                 News
date                                            July 1, 2017
label                                                      0
combine    trump get humili after whine about state refus...
Name: 875, dtype: object

In [38]:
y_pred[875]

1

## Optional Code

In [14]:
# news_out = pd.merge(news,y_test,how = 'left',left_index = True, right_index = True)
# temp = news_out[~(news_out[['label_y']].isnull().any(axis=1))]
# temp.loc[(temp['label_x'] == temp['label_y'])]

**Write up**: 
* Link to the model on Hugging Face Hub: 
* Include some examples of misclassified news articles. Please explain what you might do to improve your model's performance on these news articles in the future (you do not need to impelement these suggestions)