上QQ阅读APP看书,第一时间看更新
How to do it...
- Initialize a new Python file by importing the following file:
import numpy as np from nltk.corpus import brown from chunking import splitter
- Define the main function and read the input data from Brown corpus:
if __name__=='__main__': content = ' '.join(brown.words()[:10000])
- Split the text content into chunks:
num_of_words = 2000 num_chunks = [] count = 0 texts_chunk = splitter(content, num_of_words)
- Build a vocabulary based on these text chunks:
for text in texts_chunk: num_chunk = {'index': count, 'text': text} num_chunks.append(num_chunk) count += 1
- Extract a document word matrix, which effectively counts the amount of incidences of each word in the document:
from sklearn.feature_extraction.text
import CountVectorizer
- Extract the document term matrix:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(min_df=5, max_df=.95) matrix = vectorizer.fit_transform([num_chunk['text'] for num_chunk in num_chunks])
- Extract the vocabulary and print it:
vocabulary = np.array(vectorizer.get_feature_names()) print "nVocabulary:" print vocabulary
- Print the document term matrix:
print "nDocument term matrix:" chunks_name = ['Chunk-0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4'] formatted_row = '{:>12}' * (len(chunks_name) + 1) print 'n', formatted_row.format('Word', *chunks_name), 'n'
- Iterate throughout the words, and print the reappearance of every word in various chunks:
for word, item in zip(vocabulary, matrix.T): # 'item' is a 'csr_matrix' data structure result = [str(x) for x in item.data] print formatted_row.format(word, *result)
- The result obtained after executing the bag-of-words model is shown as follows:
In order to understand how it works on a given sentence, refer to the following:
- Introduction to Sentiment Analysis, explained here: https://blog.algorithmia.com/introduction-sentiment-analysis/