用word2vec分析标题,预测文章转发量
流行新闻网站的标题情绪分布图
词嵌入是表达词语以及文档(即词语集合)潜在信息的有效方式。利用一系列关于文章标题的数据集,包括来源、情感、主题以及转发量等影响因素,基于各篇文章的词嵌入方法,可以它们之间的关系。
本文研究目标为:
使用NLTK预处理/清理文本数据
使用word2vec创建词汇和标题嵌入,并通过t-SNE使其显示为数据集
将标题情绪与文章转发量之间的关系可视化
尝试通过嵌入及其他相关方法预测文章转发量
使用模型融合来提升文章转发量预测的准确性(此步最终未实现)
操作的全过程可以参见:
https://nbviewer.jupyter.org/github/chambliss/Notebooks/blob/master/Word2VecNewsAnalysis.ipynb
数据输入和预处理
首先,数据输入如下:
import pandas as pd
import gensim
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import xgboost as xgb
然后读取数据:
maindata = pd.readcsv('NewsFinal.csv')
maindata.head()
Grab all the titles
articletitles = maindata['Title']
Create a list of strings, one for each title
titleslist = [title for title in articletitles]
Collapse the list of strings into a single long string for processing
bigtitlestring = ' '.join(titles_list)
from nltk.tokenize import word_tokenize
Tokenize the string into words
tokens = wordtokenize(bigtitle_string)
Remove non-alphabetic tokens, such as punctuation
words = [word.lower() for word in tokens if word.isalpha()]
Filter out stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if not word in stop_words]
Print first 10 words
words[:10]
接下来,需要加载预先设定好的word2vec模型(可参照https://github.com/RaRe-Technologies/gensim- data)。由于这是一个新闻数据集,笔者使用的是谷歌新闻模型,该模型已经通过大约1000亿词汇量的测试。
Load word2vec model (trained on an enormous Google corpus)
model = gensim.models.KeyedVectors.loadword2vecformat('GoogleNews-vectors- negative300.bin', binary = True)
Check dimension of word vectors
model.vector_size
300
该模型将生成300维的词汇向量,生成的向量必须通过模型传递下去。向量列举如下:
economyvec = model['economy']
economyvec[:20] # First 20 components
word2vec只能依据自己的词汇库来创造向量。因此,在创建词汇向量的完整列表时,要指定“if model in model.vocab”
Filter the list of vectors to include only those that Word2Vec has a vector
for
vector_list = [model[word] for word in words if word in model.vocab]
Create a list of the words corresponding to these vectors
words_filtered = [word for word in words if word in model.vocab]
Zip the words together with their vector representations
wordveczip = zip(wordsfiltered, vectorlist)
Cast to a dict so we can turn it into a DataFrame
wordvecdict = dict(wordveczip)
df = pd.DataFrame.fromdict(wordvec_dict, orient='index')
df.head(3)
使用t-SNE降低维数
接下来,使用t-SNE压缩这些词汇向量(即降维)。如果想了解更多关于t- SNE的相关知识,请查看:https://distill.pub/2016/misread-tsne/。
选择t- SNE的参数非常重要,因为不同的参数会产生截然不同的结果。笔者测试了0到100之间的几个值的准确性,每次产生的结果大致相同。笔者又测试了20到400之间几个学习率的数值,最终决定将学习率设为在默认值(200)。
为了节省处理时间,笔者并没有使用全部的20000个单词向量,而是只选用了其中的400个。
from sklearn.manifold import TSNE
Initialize t-SNE
tsne = TSNE(ncomponents = 2, init = 'random', randomstate = 10, perplexity = 100)
Use only 400 rows to shorten processing time
tsnedf = tsne.fittransform(df[:400])
现在,可以准备测绘减少的词汇向量数组的二维图。在这里,笔者使用adjust_text将单词自动分开。
sns.set()
Initialize figure
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(tsnedf[:, 0], tsnedf[:, 1], alpha = 0.5)
Import adjustText, initialize list of texts
from adjustText import adjusttext
texts = []
wordsto_plot = list(np.arange(0, 400, 10))
Append words to list
for word in wordstoplot:
texts.append(plt.text(tsnedf[word, 0], tsnedf[word, 1], df.index[word],
fontsize = 14))
Plot text using adjust_text (because overlapping text is hard to read)
adjusttext(texts, forcepoints = 0.4, forcetext = 0.4,
expandpoints = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
如果你有兴趣使用adjust_text来进行测绘,可以参照:https://github.com/Phlya/adjustText。注意要使用驼峰格式adjustText进行导入,且adjustText目前与matplotlib3.0或其更高版本不兼容。
即使将矢量嵌入减少到2维,还是可以看到某些项聚集在一起。例如,我们在左/左上角导入“month”,在底部导入“corporatefinance terms”,在中间导入较多的通用非主题词(如“full”, “really”, “slew”)。
需要注意的是,如果使用不同的参数再次运行t- SNE,可能会得到相似的结果,但不一定完全相同。t-SNE不是确定性的,数据集的紧密度和它们之间的分散距离并不能说明什么。t-SNE主要是一种探索性工具,而不是相似性的决定性指标。
平均 词嵌入
上文已经介绍了如何将词嵌入的方法应用于此数据集中。接下来,本文将探讨ML的应用:找到聚集的标题集,并研究它们的关系。
Doc2Vec没有预先设定好的模型,因此需要更长时间的测试过程。笔者在这里使用更简洁更有效率的方法:即平均每个文档中单词向量的嵌入。在这个例子中,文档指的就是标题。
此处需要对标题再次进行预处理步骤,这要比分隔词汇的处理更复杂。Dimitris Spathis开发了一系列功能,在本例中可以得到很好的应用。
def documentvector(word2vecmodel, doc):
remove out-of-vocabulary words
doc = [word for word in doc if word in model.vocab]
return np.mean(model[doc], axis=0)
Our earlier preprocessing was done when we were dealing only with word
vectors
Here, we need each document to remain a document
def preprocess(text):
text = text.lower()
doc = wordtokenize(text)
doc = [word for word in doc if word not in stopwords]
doc = [word for word in doc if word.isalpha()]
return doc
Function that will help us drop documents that have no word vectors in
word2vec
def hasvectorrepresentation(word2vecmodel, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vecmodel.vocab for word in doc)
Filter out documents
def filterdocs(corpus, texts, conditionondoc):
"""
Filter corpus and texts given the function conditionondoc which takes a doc.
The document doc is kept if conditionondoc(doc) is true.
"""
numberof_docs = len(corpus)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if conditionondoc(doc)]
corpus = [doc for doc in corpus if conditionondoc(doc)]
print("{} docs removed".format(numberofdocs - len(corpus)))
return (corpus, texts)
根据上述结果,进行下一步处理:
Preprocess the corpus
corpus = [preprocess(title) for title in titles_list]
Remove docs that don't include any words in W2V's vocab
corpus, titleslist = filterdocs(corpus, titleslist, lambda doc: hasvector_representation(model, doc))
Filter out any empty docs
corpus, titleslist = filterdocs(corpus, titles_list, lambda doc: (len(doc) != 0))
x = []
for doc in corpus: # append the vector for each document
x.append(document_vector(model, doc))
X = np.array(x) # list to array
t-SNE操作: 文件向量
现在,我们已成功创建了文档向量数组,对它们进行t-SNE测算可能会得到与上文类似的结果。
Initialize t-SNE
tsne = TSNE(ncomponents = 2, init = 'random', randomstate = 10, perplexity = 100)
Again use only 400 rows to shorten processing time
tsnedf = tsne.fittransform(X[:400])
fig, ax = plt.subplots(figsize = (14, 10))
sns.scatterplot(tsnedf[:, 0], tsnedf[:, 1], alpha = 0.5)
from adjustText import adjusttext
texts = []
titlesto_plot = list(np.arange(0, 400, 40)) # plots every 40th title in first
400 titles
Append words to list
for title in titlestoplot:
texts.append(plt.text(tsnedf[title, 0], tsnedf[title, 1],
titles_list[title], fontsize = 14))
Plot text using adjust_text
adjusttext(texts, forcepoints = 0.4, forcetext = 0.4,
expandpoints = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
根据上述操作,可以发现t- SNE将文档向量折叠成一个维度空间,文档根据其内容与某些领域的相关性而展开,如与国家、世界领导者、外交事务及技术公司等方面的相关性。
现在可以对文章的转发量进行研究。通常,人们认为标题要耸人听闻才会点击率越高,文章被转发分享的可能性也就越高。那么这种共识在数据集中是否能够得到验证呢?下一部分将进行具体介绍。
文章转发量与标题情绪分析
首先,我们需要删除所有没有来源或者无法判定转发量的文章。转发量的零位测量在数据中表示为-1.
Drop all the rows where the article popularities are unknown (this is only
about 11% of the data)
maindata = maindata.drop(maindata[(maindata.Facebook == -1) |
(maindata.GooglePlus == -1) |
(maindata.LinkedIn == -1)].index)
Also drop all rows where we don't know the source
maindata = maindata.drop(maindata[maindata['Source'].isna()].index)
main_data.shape
删除一部分文章后,仍有81000篇文章可供使用。接下来,可以研究标题情绪与文章的转发量之间是否存在某种相关性。
fig, ax = plt.subplots(1, 3, figsize=(15, 10))
subplots = [a for a in ax]
platforms = ['Facebook', 'GooglePlus', 'LinkedIn']
colors = list(sns.husl_palette(10, h=.5)[1:4])
for platform, subplot, color in zip(platforms, subplots, colors):
sns.scatterplot(x = maindata[platform], y = maindata['SentimentTitle'],
ax=subplot, color=color)
subplot.settitle(platform, fontsize=18)
subplot.setxlabel('')
fig.suptitle('Plot of Popularity (Shares) by Title Sentiment', fontsize=24)
plt.show()
经过上述操作,依然无法明确两者之间的相关性,因为某些文章的转发量显著高于其他文章。可以尝试对x轴进行对数转换,并使用regplot,这样seaborn可以覆盖所有的线性回归。
Our data has over 80,000 rows, so let's also subsample it to make the log-
transformed scatterplot easier to read
subsample = main_data.sample(5000)
fig, ax = plt.subplots(1, 3, figsize=(15, 10))
subplots = [a for a in ax]
for platform, subplot, color in zip(platforms, subplots, colors):
Regression plot, so we can gauge the linear relationship
sns.regplot(x = np.log(subsample[platform] + 1), y =
subsample['SentimentTitle'],
ax=subplot,
color=color,
Pass an alpha value to regplot's scatterplot call
scatter_kws={'alpha':0.5})
Set a nice title, get rid of x labels
subplot.settitle(platform, fontsize=18)
subplot.setxlabel('')
fig.suptitle('Plot of log(Popularity) by Title Sentiment', fontsize=24)
plt.show()
与我们的预期相反,在这个数据集中,吸睛的标题与文章转发量并没有直接的联系。为了更清楚文章转发量的影响因素,可以再做一次关于文章转发量的二维图测绘。
fig, ax = plt.subplots(3, 1, figsize=(15, 10))
subplots = [a for a in ax]
for platform, subplot, color in zip(platforms, subplots, colors):
sns.distplot(np.log(maindata[platform] + 1), ax=subplot, color=color, kdekws={'shade':True})
Set a nice title, get rid of x labels
subplot.settitle(platform, fontsize=18)
subplot.setxlabel('')
fig.suptitle('Plot of Popularity by Platform', fontsize=24)
plt.show()
接下来,当发布者不同时,吸睛的标题所产生的影响是否与上述结论一致呢?
Get the list of top 12 sources by number of articles
sourcenames = list(maindata['Source'].valuecounts()[:12].index)
sourcecolors = list(sns.husl_palette(12, h=.5))
fig, ax = plt.subplots(4, 3, figsize=(20, 15), sharex=True, sharey=True)
ax = ax.flatten()
for ax, source, color in zip(ax, sourcenames, sourcecolors):
sns.distplot(maindata.loc[maindata['Source'] == source]['SentimentTitle'],
ax=ax, color=color, kdekws={'shade':True})
ax.settitle(source, fontsize=14)
ax.set_xlabel('')
plt.xlim(-0.75, 0.75)
plt.show()
得出的图像大体一致,但难以分辨细节的区别。因此,尝试将这些图像全部叠加在一起。
Overlay each density curve on the same plot for closer comparison
fig, ax = plt.subplots(figsize=(12, 8))
for source, color in zip(sourcenames, sourcecolors):
sns.distplot(maindata.loc[maindata['Source'] == source]['SentimentTitle'],
ax=ax, hist=False, label=source, color=color)
ax.set_xlabel('')
plt.xlim(-0.75, 0.75)
plt.show()
可以发现,消息来源与标题情绪的分布图非常相似,没有任何一个来源的值是过高或过低的,12个最常见的源都以0为中心,分布图的尾部值也都很接近。但是这能说明真正的关系吗?请看下面这组数据:
Group by Source, then get descriptive statistics for title sentiment
sourceinfo = maindata.groupby('Source')['SentimentTitle'].describe()
Recall that source_names
contains the top 12 sources
We'll also sort by highest standard deviation
sourceinfo.loc[sourcenames].sort_values('std', ascending=False)[['std', 'min', 'max']]
非常明显的是,与其他新闻发布者相比,华尔街日报分布图的标准差最大,范围最大。这表明,华尔街日报经常使用带有负面情绪的标题。这是一个很有趣的潜在发现。但要严格地对这个结论进行验证,需要进行假设检验。这超出了本文的研究范围。
文章转发量预测
第一个任务是重新加入带有各自标题的文档向量。在进行预处理时,笔者同时处理了语料库和titleslist,因此向量及其所代表的标题仍将匹配。同时,在maindf中,笔者删除了所有人气值为-1的文章,因此也需要删除代表这些文章标题的向量。
在如此巨大的数据集上构建模型绝非易事。因此,可以减少一部分向量的维度。基于Unix时间,笔者根据文章的发布日期设计了一个新的变量。
具体可以参见:
https://en.wikipedia.org/wiki/Unix_time
import datetime
Convert publish date column to make it compatible with other datetime
objects
maindata['PublishDate'] = pd.todatetime(main_data['PublishDate'])
Time since Linux Epoch
t = datetime.datetime(1970, 1, 1)
Subtract this time from each article's publish date
maindata['TimeSinceEpoch'] = maindata['PublishDate'] - t
Create another column for just the days from the timedelta objects
maindata['DaysSinceEpoch'] = maindata['TimeSinceEpoch'].astype('timedelta64[D]')
main_data['TimeSinceEpoch'].describe()
根据上述操作可见,这些文章发布的时间形成了一个250天的数据集合。
from sklearn.decomposition import PCA
pca = PCA(ncomponents=15, randomstate=10)
as a reminder, x is the array with our 300-dimensional vectors
reducedvecs = pca.fittransform(x)
dfwvectors = pd.DataFrame(reduced_vecs)
dfwvectors['Title'] = titles_list
Use pd.concat to match original titles with their vectors
mainwvectors = pd.concat((dfwvectors, main_data), axis=1)
Get rid of vectors that couldn't be matched with the main_df
mainwvectors.dropna(axis=0, inplace=True)
现在,需要删除非数字和非虚拟列,以便更好地建模。笔者还将比例缩放功能应用于DaysSinceEpoch,因为与其他变量相比,发布时间的数据范围更广。
Drop all non-numeric, non-dummy columns, for feeding into the models
colstodrop = ['IDLink', 'Title', 'TimeSinceEpoch', 'Headline', 'PublishDate', 'Source']
dataonlydf = pd.getdummies(mainwvectors, columns = ['Topic']).drop(columns=colsto_drop)
Standardize DaysSinceEpoch since the raw numbers are larger in magnitude
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Reshape so we can feed the column to the scaler
standardizeddays = np.array(dataonlydf['DaysSinceEpoch']).reshape(-1, 1)
dataonlydf['StandardizedDays'] = scaler.fittransform(standardized_days)
Drop the raw column; we don't need it anymore
dataonlydf.drop(columns=['DaysSinceEpoch'], inplace=True)
Look at the new range
dataonlydf['StandardizedDays'].describe()
Get Facebook data only
fbdataonlydf = dataonly_df.drop(columns=['GooglePlus', 'LinkedIn'])
Separate the features and the response
X = fbdataonlydf.drop('Facebook', axis=1)
y = fbdataonlydf['Facebook']
80% of data goes to training
Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize = 0.2, randomstate = 10)
现在对数据进行非优化的XGBoost处理:
from sklearn.metrics import meansquarederror
Instantiate an XGBRegressor
xgr = xgb.XGBRegressor(random_state=2)
Fit the classifier to the training set
xgr.fit(Xtrain, ytrain)
ypred = xgr.predict(Xtest)
meansquarederror(ytest, ypred)
353495.42
可以说,得出的结果并不令人满意。那么,是否可以通过超参数调整来改善结果呢?笔者根据以下文章重新设计了一个超参数调整布局,具体参见:
https://www.kaggle.com/jayatou/xgbregressor-with-gridsearchcv
from sklearn.model_selection import GridSearchCV
Various hyper-parameters to tune
xgb1 = xgb.XGBRegressor()
parameters = {'nthread':[4],
'objective':['reg:linear'],
'learningrate': [.03, 0.05, .07],
'maxdepth': [5, 6, 7],
'minchildweight': [4],
'silent': [1],
'subsample': [0.7],
'colsamplebytree': [0.7],
'nestimators': [250]}
xgbgrid = GridSearchCV(xgb1,
parameters,
cv = 2,
njobs = 5,
verbose=True)
xgbgrid.fit(Xtrain, y_train)
根据xgb_grid,最佳参数如下:
{'colsamplebytree': 0.7, 'learningrate': 0.03,'maxdepth': 5, 'minchildweight': 4, 'nestimators': 250, 'nthread': 4,'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}
使用新参数再试一次:
params = {'colsamplebytree': 0.7, 'learningrate':0.03, 'maxdepth': 5,
'minchildweight': 4,
'nestimators': 250, 'nthread':4, 'objective': 'reg:linear', 'silent': 1,
'subsample': 0.7}
# Try again with new params
xgr = xgb.XGBRegressor(random_state=2, **params)
# Fit the classifier to the training set
xgr.fit(Xtrain, ytrain)
ypred = xgr.predict(Xtest)
meansquarederror(ytest, ypred)
351220.13
可以看出,数据量在35000左右的时候,结果更好。因此,可以推断,当前状态下的数据量不足导致模型产生的结果不好。那么,是否可以对数据特征进行改善呢?笔者利用分类方法将文章分为两组:Duds(0或1次转载)与Not Duds。
如果能够赋予回归量一个新的特征,将更有利于预测转发量更高的文章,从而使残值降低,并减少均方误差。
进行Detour操作: 检测Dud文章
从之前制作的对数转换图中可以看出,总共有两个文章集合,一个集合在0的位置,而另一个从1开始。笔者用一些分类方法来识别文章是否属于“Duds”(即转载次数只有0或1次),并将这种分类作为最终回归量的特征,以此来预测文章的转发量的可能性。这种方法被称为模型融合。
Define a quick function that will return 1 (true) if the article has 0-1
share(s)
def dud_finder(popularity):
if popularity <= 1:
return 1
else:
return 0
Create target column using the function
fbdataonlydf['isdud'] = fbdataonlydf['Facebook'].apply(dudfinder)
fbdataonlydf[['Facebook', 'isdud']].head()
28% of articles can be classified as "duds"
fbdataonlydf['isdud'].sum() / len(fbdataonly_df)
0.28
现在,dud分类已经完成。接下来,要使用随机森林算法(RandomForest)、优化的XGBC分类法和K-Nearest Neighbors分类法。这里对于优化XGB的过程不做赘述。
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.modelselection import traintest_split
X = fbdataonlydf.drop(['isdud', 'Facebook'],axis=1)
y = fbdataonlydf['isdud']
80% of data goes to training
Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize = 0.2,randomstate = 10)
Best params, produced by HP tuning
params = {'colsamplebytree': 0.7, 'learningrate': 0.03, 'maxdepth':
5,'minchildweight': 4,
'nestimators': 200, 'nthread':4, 'silent': 1, 'subsample': 0.7}
Try xgc again with new params
xgc = xgb.XGBClassifier(randomstate=10, **params)
rfc = RandomForestClassifier(nestimators=100, random_state=10)
knn = KNeighborsClassifier()
preds = {}
for modelname, model in zip(['XGClassifier',
'RandomForestClassifier','KNearestNeighbors'], [xgc, rfc, knn]):
model.fit(Xtrain, ytrain)
preds[modelname] =model.predict(X_test)
测试模型,获取分类报告:
from sklearn.metrics import classificationreport,roccurve, rocaucscore
for k in preds:
print("{}performance:".format(k))
print()
print(classificationreport(ytest,preds[k]), sep='\n')
可以看出,f1分数最高的是XGC分类法,其次是RF,最后是KNN,但KNN在数据撤回方面做得最好(能够成功识别duds)。这足以说明模型融合的意义,因为像XGBoost这样的优秀模型也会在识别任务上表现不佳,而KNN的预测缺乏一定的多样性。
Plot ROC curves
for model in [xgc, rfc, knn]:
fpr, tpr, thresholds =roccurve(ytest, model.predictproba(Xtest)[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.show()
文章人气预测: 第2轮
现在我们可以从三种分类方法中得出预测概率的平均值,并将其作为回归量的特征。
averagedprobs = (xgc.predictproba(X)[:, 1] +
knn.predictproba(X)[:,1] +
rfc.predictproba(X)[:,1]) / 3
X['probdud'] = averagedprobs
y = fbdataonly_df['Facebook']
接下来将进行再一轮对向量特征的调整,此处不作赘述。结果如下所示:
xgb.XGBRegressor(random_state=2,**params)
Fit the classifier to the training set
xgr.fit(Xtrain, ytrain)
ypred = xgr.predict(Xtest)
meansquarederror(ytest, ypred)
314551.59
所得结果与模型融合之前所得的结果基本相同。这就表明,MSE在测量误差时,异常值所占的比重较大。其实,我们还可以计算平均绝对误差(MAE),以此评估显著异常值产生的影响。在数学术语中,MAE可以用来计算残差绝对值的L1范数,而不是MSE应用的L2范数。我们可以将MAE与MSE的平方根进行比较,也称为均方根误差(RMSE)。
meanabsoluteerror(ytest,ypred), np.sqrt(meansquarederror(ytest, ypred))
(180.528167661507,560.8489939081992)
平均绝对误差仅为RMSE的1/3左右。模型的准确性比当初设想得要好很多。最后一步,根据XGRegressor了解每个特征的重要性:
zip(list(X.columns),xgr.featureimportances):
print('Model weight for feature {}:{}'.format(feature, importance))
结果发现,prob_dud是最重要的特征,自定义的StandardsDays功能是第二重要的特征。(特征值0到14对应于缩小的标题嵌入向量。)
尽管通过这一轮模型融合没有改善整体的结果,但我们成功发现了数据变化的重要来源。
如果继续研究下去,笔者会考虑使用外部数据来增加数据量,包括通过分箱或散列将Source作为变量,在原始的300维向量上运行模型,使用每篇文章在不同时间的人气的“分片”数据来进行预测。
本文所使用的原始数据可以参见:
https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms
作者暂无likerid, 赞赏暂由本网站代持,当作者有likerid后会全部转账给作者(我们会尽力而为)。Tips: Until now, everytime you want to store your article, we will help you store it in Filecoin network. In the future, you can store it in Filecoin network using your own filecoin.
Support author:
Author's Filecoin address:
Or you can use Likecoin to support author: