先贴一篇IBM沃森文本分析历届美国总统就职的演讲稿:http://36kr.com/p/5062661.html
2017年1月20日中午,唐纳德-特朗普在首都华盛顿宣誓就职,正式成为美国第45任总统。完成了从房地产老板直接到美国总统的华丽转身!今儿我们就对他的演讲稿进行文本分析,看看能不能翻出啥有意思的点。
首先,在CNN上面找到了川普就职的演讲稿及演讲视频,具体网址:https://edition.cnn.com/2017/01/20/politics/trump-inaugural-address/index.html
代码如下:
speech_text='''演讲稿内容'''
speech=speech_test.lower().split() #对演讲稿内容进行小写转化,并进行单个词的分割
dic={} #建立空词典,储存演讲稿中的词
for word in speech:
if word in dic:
dic[word]+=1
else:
dic[word]=1
dic#显示词典,接下来就是对词典进行处理,词典长这么个样子(只复制了一部分)
{'"how': 1, '--': 4, '17:17': 1, '2017,': 1, '20th': 1, 'a': 15, 'about': 2, 'accept': 1, 'across': 5, 'action': 1, 'action.': 1, 'address': 1, 'administration': 1, 'affairs,': 1, 'again,': 1} |
import operator
swd=sorted(dic.items(),key=operator.itemgetter(1),reverse=True) #对词典内容进行value的提取,并且按照value逆序进行排序
swd#显示swd,具体如下,按照数值从大到小进行的排序(限于篇幅,只复制了一部分):
[('and', 73), ('the', 71), ('of', 48), ('our', 48), ('we', 45), ('will', 40), ('to', 37), ('is', 21), ('a', 15), ('for', 15), ('are', 14), ('in', 14), ('but', 13), ('all', 12), ('from', 12), ('be', 12), ('their', 11), ('american', 11), ('your', 11), ('not', 10), ('america', 9), ('this', 9), ('it', 9), ('that', 8), ('again.', 8), ('with', 8), ('every', 7), ('one', 7), ('you', 7), ('people', 6), ('great', 6), ('country', 6), ('on', 6), ('has', 6), ('back', 6), ('while', 6), ('by', 6), ('no', 6), ('new', 6), ('same', 6), ('president', 5), ('they', 5), ('have', 5), ('across', 5), ('right', 5), ('never', 5), ('at', 5), ('make', 5), ('you.', 4), ('america,', 4), ('world', 4), ('been', 4), ('today', 4), ('or', 4), ('--', 4), ('everyone', 4), ('which', 4), ('as', 4), ('nation', 4), ('other', 4), ('bring', 4), ('now', 3), ('its', 3), ('people.', 3), ('together,', 3), ('these', 3), ('too', 3), ("nation's", 3), ('factories', 3), ('protected', 3), ('there', 3), ('here', 3), ('america.', 3), ('whether', 3), ('millions', 3), ('many', 3), ('an', 3), ('so', 3), ('i', 3), ("we've", 3), ('foreign', 3), ('countries', 3), ('must', 3), ('let', 3), ('do', 3), ('when', 3), ('heart', 3), ('entire', 2), ('americans,', 2), ('thank', 2), ('citizens', 2), ('national', 2), ('face', 2), ('get', 2), ('done.', 2), ('obama', 2), ('very', 2), ('because', 2), ('transferring', 2), ('power', 2), ('party', 2), ('small', 2), ('government', 2), ('share', 2), ('wealth.', 2), ('politicians', 2), ('jobs', 2), ('country.', 2), ('capital,', 2), ('land.', 2), ('moment', 2), ('belongs', 2), ('united', 2), ('states', 2), ('day', 2), ('forgotten', 2), ('men', 2), ('women', 2), ('now.', 2), ('movement', 2), ('before.', 2), ('safe', 2), ('good', 2), ('like', 2), ] |
#可以看得出来,里面有很多没有用处的词语,我们需要进行剔除,所以需要导入停止词。
import nltk
from nltk.corpus import stopwords
stop_words=stopwords.words('English')
stop_words#可以查看导入了哪些停止词,限于篇幅,也只黏贴一部分
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their'] |
for k,v in swd: #将swd中去除停止词
if k not in stop_words:
print(k,v)
我们可以看到词汇统计结果如下: