用word2vec分析标题，预测文章转发量

Sharer: 机器之心 May 5, 2019 at 3:30 am

Link Share :https://www.jiqizhixin.com/articles/2019-05-05 - via RSS

流行新闻网站的标题情绪分布图

词嵌入是表达词语以及文档（即词语集合）潜在信息的有效方式。利用一系列关于文章标题的数据集，包括来源、情感、主题以及转发量等影响因素，基于各篇文章的词嵌入方法，可以它们之间的关系。

本文研究目标为:

使用NLTK预处理/清理文本数据
使用word2vec创建词汇和标题嵌入，并通过t-SNE使其显示为数据集
将标题情绪与文章转发量之间的关系可视化
尝试通过嵌入及其他相关方法预测文章转发量
使用模型融合来提升文章转发量预测的准确性（此步最终未实现）

操作的全过程可以参见：

https://nbviewer.jupyter.org/github/chambliss/Notebooks/blob/master/Word2VecNewsAnalysis.ipynb

数据输入和预处理

首先，数据输入如下:

import pandas as pd
import gensim
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import xgboost as xgb

然后读取数据：

maindata = pd.readcsv('NewsFinal.csv')
maindata.head()

Grab all the titles

articletitles = maindata['Title']

Create a list of strings, one for each title

titleslist = [title for title in articletitles]

Collapse the list of strings into a single long string for processing

bigtitlestring = ' '.join(titles_list)

from nltk.tokenize import word_tokenize

Tokenize the string into words

tokens = wordtokenize(bigtitle_string)

Remove non-alphabetic tokens, such as punctuation

words = [word.lower() for word in tokens if word.isalpha()]

Filter out stopwords

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

words = [word for word in words if not word in stop_words]

Print first 10 words

words[:10]

接下来，需要加载预先设定好的word2vec模型（可参照https://github.com/RaRe-Technologies/gensim- data）。由于这是一个新闻数据集，笔者使用的是谷歌新闻模型，该模型已经通过大约1000亿词汇量的测试。

Load word2vec model (trained on an enormous Google corpus)

model = gensim.models.KeyedVectors.loadword2vecformat('GoogleNews-vectors- negative300.bin', binary = True)

Check dimension of word vectors

model.vector_size

300

该模型将生成300维的词汇向量，生成的向量必须通过模型传递下去。向量列举如下：

economyvec = model['economy']
economyvec[:20] # First 20 components

word2vec只能依据自己的词汇库来创造向量。因此，在创建词汇向量的完整列表时，要指定“if model in model.vocab”

Filter the list of vectors to include only those that Word2Vec has a vector

for
vector_list = [model[word] for word in words if word in model.vocab]

Create a list of the words corresponding to these vectors

words_filtered = [word for word in words if word in model.vocab]

Zip the words together with their vector representations

wordveczip = zip(wordsfiltered, vectorlist)

Cast to a dict so we can turn it into a DataFrame

wordvecdict = dict(wordveczip)
df = pd.DataFrame.fromdict(wordvec_dict, orient='index')

df.head(3)

使用t-SNE降低维数

接下来，使用t-SNE压缩这些词汇向量（即降维）。如果想了解更多关于t- SNE的相关知识，请查看：https://distill.pub/2016/misread-tsne/。

选择t- SNE的参数非常重要，因为不同的参数会产生截然不同的结果。笔者测试了0到100之间的几个值的准确性，每次产生的结果大致相同。笔者又测试了20到400之间几个学习率的数值，最终决定将学习率设为在默认值（200）。

为了节省处理时间，笔者并没有使用全部的20000个单词向量，而是只选用了其中的400个。

from sklearn.manifold import TSNE

Initialize t-SNE

tsne = TSNE(ncomponents = 2, init = 'random', randomstate = 10, perplexity = 100)

Use only 400 rows to shorten processing time

tsnedf = tsne.fittransform(df[:400])

现在，可以准备测绘减少的词汇向量数组的二维图。在这里，笔者使用adjust_text将单词自动分开。

sns.set()

Initialize figure

fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(tsnedf[:, 0], tsnedf[:, 1], alpha = 0.5)

Import adjustText, initialize list of texts

from adjustText import adjusttext
texts = []
wordsto_plot = list(np.arange(0, 400, 10))

Append words to list

for word in wordstoplot:
texts.append(plt.text(tsnedf[word, 0], tsnedf[word, 1], df.index[word], fontsize = 14))

Plot text using adjust_text (because overlapping text is hard to read)

adjusttext(texts, forcepoints = 0.4, forcetext = 0.4,
expandpoints = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))

plt.show()

如果你有兴趣使用adjust_text来进行测绘，可以参照：https://github.com/Phlya/adjustText。注意要使用驼峰格式adjustText进行导入，且adjustText目前与matplotlib3.0或其更高版本不兼容。

即使将矢量嵌入减少到2维，还是可以看到某些项聚集在一起。例如，我们在左/左上角导入“month”，在底部导入“corporatefinance terms”，在中间导入较多的通用非主题词（如“full”, “really”, “slew”）。

需要注意的是，如果使用不同的参数再次运行t- SNE，可能会得到相似的结果，但不一定完全相同。t-SNE不是确定性的，数据集的紧密度和它们之间的分散距离并不能说明什么。t-SNE主要是一种探索性工具，而不是相似性的决定性指标。

平均词嵌入

上文已经介绍了如何将词嵌入的方法应用于此数据集中。接下来，本文将探讨ML的应用：找到聚集的标题集，并研究它们的关系。

Doc2Vec没有预先设定好的模型，因此需要更长时间的测试过程。笔者在这里使用更简洁更有效率的方法：即平均每个文档中单词向量的嵌入。在这个例子中，文档指的就是标题。

此处需要对标题再次进行预处理步骤，这要比分隔词汇的处理更复杂。Dimitris Spathis开发了一系列功能，在本例中可以得到很好的应用。

def documentvector(word2vecmodel, doc):

remove out-of-vocabulary words

doc = [word for word in doc if word in model.vocab]
return np.mean(model[doc], axis=0)

Our earlier preprocessing was done when we were dealing only with word

vectors

Here, we need each document to remain a document

def preprocess(text):
text = text.lower()
doc = wordtokenize(text)
doc = [word for word in doc if word not in stopwords]
doc = [word for word in doc if word.isalpha()]
return doc

Function that will help us drop documents that have no word vectors in

word2vec
def hasvectorrepresentation(word2vecmodel, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vecmodel.vocab for word in doc)

Filter out documents

def filterdocs(corpus, texts, conditionondoc):
"""
Filter corpus and texts given the function conditionondoc which takes a doc. The document doc is kept if conditionondoc(doc) is true.
"""
numberof_docs = len(corpus)

if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if conditionondoc(doc)]

corpus = [doc for doc in corpus if conditionondoc(doc)]

print("{} docs removed".format(numberofdocs - len(corpus)))

return (corpus, texts)

根据上述结果，进行下一步处理：

Preprocess the corpus

corpus = [preprocess(title) for title in titles_list]

Remove docs that don't include any words in W2V's vocab

corpus, titleslist = filterdocs(corpus, titleslist, lambda doc: hasvector_representation(model, doc))

Filter out any empty docs

corpus, titleslist = filterdocs(corpus, titles_list, lambda doc: (len(doc) != 0))

x = []
for doc in corpus: # append the vector for each document
x.append(document_vector(model, doc))

X = np.array(x) # list to array

t-SNE操作：文件向量

现在，我们已成功创建了文档向量数组，对它们进行t-SNE测算可能会得到与上文类似的结果。

Initialize t-SNE

tsne = TSNE(ncomponents = 2, init = 'random', randomstate = 10, perplexity = 100)

Again use only 400 rows to shorten processing time

tsnedf = tsne.fittransform(X[:400])

fig, ax = plt.subplots(figsize = (14, 10))
sns.scatterplot(tsnedf[:, 0], tsnedf[:, 1], alpha = 0.5)

from adjustText import adjusttext
texts = []
titlesto_plot = list(np.arange(0, 400, 40)) # plots every 40th title in first 400 titles

Append words to list

for title in titlestoplot:
texts.append(plt.text(tsnedf[title, 0], tsnedf[title, 1], titles_list[title], fontsize = 14))

Plot text using adjust_text

adjusttext(texts, forcepoints = 0.4, forcetext = 0.4,
expandpoints = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))

plt.show()

根据上述操作，可以发现t- SNE将文档向量折叠成一个维度空间，文档根据其内容与某些领域的相关性而展开，如与国家、世界领导者、外交事务及技术公司等方面的相关性。

现在可以对文章的转发量进行研究。通常，人们认为标题要耸人听闻才会点击率越高，文章被转发分享的可能性也就越高。那么这种共识在数据集中是否能够得到验证呢？下一部分将进行具体介绍。

文章转发量与标题情绪分析

首先，我们需要删除所有没有来源或者无法判定转发量的文章。转发量的零位测量在数据中表示为-1.

Drop all the rows where the article popularities are unknown (this is only

about 11% of the data)
maindata = maindata.drop(maindata[(maindata.Facebook == -1) |
(maindata.GooglePlus == -1) |
(maindata.LinkedIn == -1)].index)

Also drop all rows where we don't know the source

maindata = maindata.drop(maindata[maindata['Source'].isna()].index)

main_data.shape

删除一部分文章后，仍有81000篇文章可供使用。接下来，可以研究标题情绪与文章的转发量之间是否存在某种相关性。

fig, ax = plt.subplots(1, 3, figsize=(15, 10))

subplots = [a for a in ax]
platforms = ['Facebook', 'GooglePlus', 'LinkedIn']
colors = list(sns.husl_palette(10, h=.5)[1:4])

for platform, subplot, color in zip(platforms, subplots, colors):
sns.scatterplot(x = maindata[platform], y = maindata['SentimentTitle'], ax=subplot, color=color)
subplot.settitle(platform, fontsize=18)
subplot.setxlabel('')

fig.suptitle('Plot of Popularity (Shares) by Title Sentiment', fontsize=24)

plt.show()

经过上述操作，依然无法明确两者之间的相关性，因为某些文章的转发量显著高于其他文章。可以尝试对x轴进行对数转换，并使用regplot，这样seaborn可以覆盖所有的线性回归。

Our data has over 80,000 rows, so let's also subsample it to make the log-

transformed scatterplot easier to read

subsample = main_data.sample(5000)

fig, ax = plt.subplots(1, 3, figsize=(15, 10))

subplots = [a for a in ax]

for platform, subplot, color in zip(platforms, subplots, colors):

Regression plot, so we can gauge the linear relationship

sns.regplot(x = np.log(subsample[platform] + 1), y = subsample['SentimentTitle'],
ax=subplot,
color=color,

Pass an alpha value to regplot's scatterplot call

scatter_kws={'alpha':0.5})

Set a nice title, get rid of x labels

subplot.settitle(platform, fontsize=18)
subplot.setxlabel('')

fig.suptitle('Plot of log(Popularity) by Title Sentiment', fontsize=24)

plt.show()

与我们的预期相反，在这个数据集中，吸睛的标题与文章转发量并没有直接的联系。为了更清楚文章转发量的影响因素，可以再做一次关于文章转发量的二维图测绘。

fig, ax = plt.subplots(3, 1, figsize=(15, 10))

subplots = [a for a in ax]

for platform, subplot, color in zip(platforms, subplots, colors):

sns.distplot(np.log(maindata[platform] + 1), ax=subplot, color=color, kdekws={'shade':True})

Set a nice title, get rid of x labels

subplot.settitle(platform, fontsize=18)
subplot.setxlabel('')

fig.suptitle('Plot of Popularity by Platform', fontsize=24)
plt.show()

接下来，当发布者不同时，吸睛的标题所产生的影响是否与上述结论一致呢？

Get the list of top 12 sources by number of articles

sourcenames = list(maindata['Source'].valuecounts()[:12].index)
sourcecolors = list(sns.husl_palette(12, h=.5))

fig, ax = plt.subplots(4, 3, figsize=(20, 15), sharex=True, sharey=True)

ax = ax.flatten()
for ax, source, color in zip(ax, sourcenames, sourcecolors):
sns.distplot(maindata.loc[maindata['Source'] == source]['SentimentTitle'],
ax=ax, color=color, kdekws={'shade':True})
ax.settitle(source, fontsize=14)
ax.set_xlabel('')

plt.xlim(-0.75, 0.75)
plt.show()

得出的图像大体一致，但难以分辨细节的区别。因此，尝试将这些图像全部叠加在一起。

Overlay each density curve on the same plot for closer comparison

fig, ax = plt.subplots(figsize=(12, 8))

for source, color in zip(sourcenames, sourcecolors):
sns.distplot(maindata.loc[maindata['Source'] == source]['SentimentTitle'],
ax=ax, hist=False, label=source, color=color)
ax.set_xlabel('')

plt.xlim(-0.75, 0.75)
plt.show()

可以发现，消息来源与标题情绪的分布图非常相似，没有任何一个来源的值是过高或过低的，12个最常见的源都以0为中心，分布图的尾部值也都很接近。但是这能说明真正的关系吗？请看下面这组数据：

Group by Source, then get descriptive statistics for title sentiment

sourceinfo = maindata.groupby('Source')['SentimentTitle'].describe()

Recall that `source_names` contains the top 12 sources

We'll also sort by highest standard deviation

sourceinfo.loc[sourcenames].sort_values('std', ascending=False)[['std', 'min', 'max']]

非常明显的是，与其他新闻发布者相比，华尔街日报分布图的标准差最大，范围最大。这表明，华尔街日报经常使用带有负面情绪的标题。这是一个很有趣的潜在发现。但要严格地对这个结论进行验证，需要进行假设检验。这超出了本文的研究范围。

文章转发量预测

第一个任务是重新加入带有各自标题的文档向量。在进行预处理时，笔者同时处理了语料库和titleslist，因此向量及其所代表的标题仍将匹配。同时，在maindf中，笔者删除了所有人气值为-1的文章，因此也需要删除代表这些文章标题的向量。

在如此巨大的数据集上构建模型绝非易事。因此，可以减少一部分向量的维度。基于Unix时间，笔者根据文章的发布日期设计了一个新的变量。

具体可以参见：

https://en.wikipedia.org/wiki/Unix_time

import datetime

Convert publish date column to make it compatible with other datetime

objects

maindata['PublishDate'] = pd.todatetime(main_data['PublishDate'])

Time since Linux Epoch

t = datetime.datetime(1970, 1, 1)

Subtract this time from each article's publish date

maindata['TimeSinceEpoch'] = maindata['PublishDate'] - t

Create another column for just the days from the timedelta objects

maindata['DaysSinceEpoch'] = maindata['TimeSinceEpoch'].astype('timedelta64[D]')

main_data['TimeSinceEpoch'].describe()

根据上述操作可见，这些文章发布的时间形成了一个250天的数据集合。

from sklearn.decomposition import PCA

pca = PCA(ncomponents=15, randomstate=10)

as a reminder, x is the array with our 300-dimensional vectors

reducedvecs = pca.fittransform(x)

dfwvectors = pd.DataFrame(reduced_vecs)

dfwvectors['Title'] = titles_list

Use pd.concat to match original titles with their vectors

mainwvectors = pd.concat((dfwvectors, main_data), axis=1)

Get rid of vectors that couldn't be matched with the main_df

mainwvectors.dropna(axis=0, inplace=True)

现在，需要删除非数字和非虚拟列，以便更好地建模。笔者还将比例缩放功能应用于DaysSinceEpoch，因为与其他变量相比，发布时间的数据范围更广。

Drop all non-numeric, non-dummy columns, for feeding into the models

colstodrop = ['IDLink', 'Title', 'TimeSinceEpoch', 'Headline', 'PublishDate', 'Source']

dataonlydf = pd.getdummies(mainwvectors, columns = ['Topic']).drop(columns=colsto_drop)

Standardize DaysSinceEpoch since the raw numbers are larger in magnitude

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

Reshape so we can feed the column to the scaler

standardizeddays = np.array(dataonlydf['DaysSinceEpoch']).reshape(-1, 1)
dataonlydf['StandardizedDays'] = scaler.fittransform(standardized_days)

Drop the raw column; we don't need it anymore

dataonlydf.drop(columns=['DaysSinceEpoch'], inplace=True)

Look at the new range

dataonlydf['StandardizedDays'].describe()

Get Facebook data only

fbdataonlydf = dataonly_df.drop(columns=['GooglePlus', 'LinkedIn'])

Separate the features and the response

X = fbdataonlydf.drop('Facebook', axis=1)
y = fbdataonlydf['Facebook']

80% of data goes to training

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize = 0.2, randomstate = 10)

现在对数据进行非优化的XGBoost处理：

from sklearn.metrics import meansquarederror

Instantiate an XGBRegressor

xgr = xgb.XGBRegressor(random_state=2)

Fit the classifier to the training set

xgr.fit(Xtrain, ytrain)

ypred = xgr.predict(Xtest)

meansquarederror(ytest, ypred)

353495.42

可以说，得出的结果并不令人满意。那么，是否可以通过超参数调整来改善结果呢？笔者根据以下文章重新设计了一个超参数调整布局，具体参见：

https://www.kaggle.com/jayatou/xgbregressor-with-gridsearchcv

from sklearn.model_selection import GridSearchCV

Various hyper-parameters to tune

xgb1 = xgb.XGBRegressor()
parameters = {'nthread':[4],
'objective':['reg:linear'],
'learningrate': [.03, 0.05, .07],
'maxdepth': [5, 6, 7],
'minchildweight': [4],
'silent': [1],
'subsample': [0.7],
'colsamplebytree': [0.7],
'nestimators': [250]}

xgbgrid = GridSearchCV(xgb1,
parameters,
cv = 2,
njobs = 5,
verbose=True)

xgbgrid.fit(Xtrain, y_train)

根据xgb_grid，最佳参数如下：

{'colsamplebytree': 0.7, 'learningrate': 0.03,'maxdepth': 5, 'minchildweight': 4, 'nestimators': 250, 'nthread': 4,'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}

使用新参数再试一次：

params = {'colsamplebytree': 0.7, 'learningrate':0.03, 'maxdepth': 5, 'minchildweight': 4,
'nestimators': 250, 'nthread':4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}

# Try again with new params
xgr = xgb.XGBRegressor(random_state=2, **params)

# Fit the classifier to the training set
xgr.fit(Xtrain, ytrain)

ypred = xgr.predict(Xtest)

meansquarederror(ytest, ypred)

351220.13

可以看出，数据量在35000左右的时候，结果更好。因此，可以推断，当前状态下的数据量不足导致模型产生的结果不好。那么，是否可以对数据特征进行改善呢？笔者利用分类方法将文章分为两组：Duds（0或1次转载）与Not Duds。

如果能够赋予回归量一个新的特征，将更有利于预测转发量更高的文章，从而使残值降低，并减少均方误差。

进行Detour操作: 检测Dud文章

从之前制作的对数转换图中可以看出，总共有两个文章集合，一个集合在0的位置，而另一个从1开始。笔者用一些分类方法来识别文章是否属于“Duds”（即转载次数只有0或1次），并将这种分类作为最终回归量的特征，以此来预测文章的转发量的可能性。这种方法被称为模型融合。

Define a quick function that will return 1 (true) if the article has 0-1

share(s)
def dud_finder(popularity):
if popularity <= 1:
return 1
else:
return 0

Create target column using the function

fbdataonlydf['isdud'] = fbdataonlydf['Facebook'].apply(dudfinder)
fbdataonlydf[['Facebook', 'isdud']].head()

28% of articles can be classified as "duds"

fbdataonlydf['isdud'].sum() / len(fbdataonly_df)

0.28

现在，dud分类已经完成。接下来，要使用随机森林算法（RandomForest）、优化的XGBC分类法和K-Nearest Neighbors分类法。这里对于优化XGB的过程不做赘述。

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.modelselection import traintest_split

X = fbdataonlydf.drop(['isdud', 'Facebook'],axis=1)
y = fbdataonlydf['isdud']

80% of data goes to training

Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize = 0.2,randomstate = 10)

Best params, produced by HP tuning

params = {'colsamplebytree': 0.7, 'learningrate': 0.03, 'maxdepth': 5,'minchildweight': 4,
'nestimators': 200, 'nthread':4, 'silent': 1, 'subsample': 0.7}

Try xgc again with new params

xgc = xgb.XGBClassifier(randomstate=10, **params)
rfc = RandomForestClassifier(nestimators=100, random_state=10)
knn = KNeighborsClassifier()

preds = {}
for modelname, model in zip(['XGClassifier', 'RandomForestClassifier','KNearestNeighbors'], [xgc, rfc, knn]):
model.fit(Xtrain, ytrain)
preds[modelname] =model.predict(X_test)

测试模型，获取分类报告:

from sklearn.metrics import classificationreport,roccurve, rocaucscore

for k in preds:
print("{}performance:".format(k))
print()
print(classificationreport(ytest,preds[k]), sep='\n')

可以看出，f1分数最高的是XGC分类法，其次是RF，最后是KNN，但KNN在数据撤回方面做得最好（能够成功识别duds）。这足以说明模型融合的意义，因为像XGBoost这样的优秀模型也会在识别任务上表现不佳，而KNN的预测缺乏一定的多样性。

Plot ROC curves

for model in [xgc, rfc, knn]:
fpr, tpr, thresholds =roccurve(ytest, model.predictproba(Xtest)[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.show()

文章人气预测：第2轮

现在我们可以从三种分类方法中得出预测概率的平均值，并将其作为回归量的特征。

averagedprobs = (xgc.predictproba(X)[:, 1] +
knn.predictproba(X)[:,1] +
rfc.predictproba(X)[:,1]) / 3

X['probdud'] = averagedprobs
y = fbdataonly_df['Facebook']

接下来将进行再一轮对向量特征的调整，此处不作赘述。结果如下所示：

xgb.XGBRegressor(random_state=2,**params)

Fit the classifier to the training set

xgr.fit(Xtrain, ytrain)

ypred = xgr.predict(Xtest)

meansquarederror(ytest, ypred)

314551.59

所得结果与模型融合之前所得的结果基本相同。这就表明，MSE在测量误差时，异常值所占的比重较大。其实，我们还可以计算平均绝对误差（MAE），以此评估显著异常值产生的影响。在数学术语中，MAE可以用来计算残差绝对值的L1范数，而不是MSE应用的L2范数。我们可以将MAE与MSE的平方根进行比较，也称为均方根误差（RMSE）。

meanabsoluteerror(ytest,ypred), np.sqrt(meansquarederror(ytest, ypred))

（180.528167661507，560.8489939081992）

平均绝对误差仅为RMSE的1/3左右。模型的准确性比当初设想得要好很多。最后一步，根据XGRegressor了解每个特征的重要性：

zip(list(X.columns),xgr.featureimportances):
print('Model weight for feature {}:{}'.format(feature, importance))

结果发现，prob_dud是最重要的特征，自定义的StandardsDays功能是第二重要的特征。（特征值0到14对应于缩小的标题嵌入向量。）

尽管通过这一轮模型融合没有改善整体的结果，但我们成功发现了数据变化的重要来源。

如果继续研究下去，笔者会考虑使用外部数据来增加数据量，包括通过分箱或散列将Source作为变量，在原始的300维向量上运行模型，使用每篇文章在不同时间的人气的“分片”数据来进行预测。

本文所使用的原始数据可以参见：

https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms

作者暂无likerid, 赞赏暂由本网站代持，当作者有likerid后会全部转账给作者（我们会尽力而为）。

Tips: Until now, everytime you want to store your article, we will help you store it in Filecoin network. In the future, you can store it in Filecoin network using your own filecoin.

Support author:
Author's Filecoin address:
Or you can use Likecoin to support author:

tags:科技

0 1

数据输入和预处理

Grab all the titles

Create a list of strings, one for each title

Collapse the list of strings into a single long string for processing

Tokenize the string into words

Remove non-alphabetic tokens, such as punctuation

Filter out stopwords

Print first 10 words

Load word2vec model (trained on an enormous Google corpus)

Check dimension of word vectors

Filter the list of vectors to include only those that Word2Vec has a vector

Create a list of the words corresponding to these vectors

Zip the words together with their vector representations

Cast to a dict so we can turn it into a DataFrame

使用t-SNE降低维数

Initialize t-SNE

Use only 400 rows to shorten processing time

Initialize figure

Import adjustText, initialize list of texts

Append words to list

Plot text using adjust_text (because overlapping text is hard to read)

平均 词嵌入

remove out-of-vocabulary words

Our earlier preprocessing was done when we were dealing only with word

Here, we need each document to remain a document

Function that will help us drop documents that have no word vectors in

Filter out documents

Preprocess the corpus

Remove docs that don't include any words in W2V's vocab

Filter out any empty docs

t-SNE操作： 文件向量

Initialize t-SNE

Again use only 400 rows to shorten processing time

Append words to list

Plot text using adjust_text

文章转发量与标题情绪分析

Drop all the rows where the article popularities are unknown (this is only

Also drop all rows where we don't know the source

Our data has over 80,000 rows, so let's also subsample it to make the log-

Regression plot, so we can gauge the linear relationship

Pass an alpha value to regplot's scatterplot call

Set a nice title, get rid of x labels

Set a nice title, get rid of x labels

Get the list of top 12 sources by number of articles

Overlay each density curve on the same plot for closer comparison

Group by Source, then get descriptive statistics for title sentiment

Recall that source_names contains the top 12 sources

We'll also sort by highest standard deviation

文章转发量预测

Convert publish date column to make it compatible with other datetime

Time since Linux Epoch

Subtract this time from each article's publish date

Create another column for just the days from the timedelta objects

as a reminder, x is the array with our 300-dimensional vectors

Use pd.concat to match original titles with their vectors

Get rid of vectors that couldn't be matched with the main_df

Drop all non-numeric, non-dummy columns, for feeding into the models

Standardize DaysSinceEpoch since the raw numbers are larger in magnitude

Reshape so we can feed the column to the scaler

Drop the raw column; we don't need it anymore

Look at the new range

Get Facebook data only

Separate the features and the response

80% of data goes to training

Instantiate an XGBRegressor

Fit the classifier to the training set

Various hyper-parameters to tune

进行Detour操作: 检测Dud文章

Define a quick function that will return 1 (true) if the article has 0-1

Create target column using the function

28% of articles can be classified as "duds"

80% of data goes to training

Best params, produced by HP tuning

Try xgc again with new params

Plot ROC curves

文章人气预测： 第2轮

Fit the classifier to the training set

平均词嵌入

t-SNE操作：文件向量

Recall that `source_names` contains the top 12 sources

文章人气预测：第2轮