I have a string \"Hello I am going to I with hello am
\". I want to find how many times a word occur in the string. Example hello occurs 2 time. I tried this app
def countSub(pat,string):
result = 0
for i in range(len(string)-len(pat)+1):
for j in range(len(pat)):
if string[i+j] != pat[j]:
break
else:
result+=1
return result
If you want to find the count of an individual word, just use count
:
input_string.count("Hello")
Use collections.Counter
and split()
to tally up all the words:
from collections import Counter
words = input_string.split()
wordCount = Counter(words)
You can divide the string into elements and calculate their number
count = len(my_string.split())
Counter from collections is your friend:
>>> from collections import Counter
>>> counts = Counter(sentence.lower().split())
Here is an alternative, case-insensitive, approach
sum(1 for w in s.lower().split() if w == 'Hello'.lower())
2
It matches by converting the string and target into lower-case.
ps: Takes care of the "am ham".count("am") == 2
problem with str.count()
pointed out by @DSM below too :)
The vector of occurrence counts of words is called bag-of-words.
Scikit-learn provides a nice module to compute it, sklearn.feature_extraction.text.CountVectorizer. Example:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
min_df = 0, \
max_features = 50)
text = ["Hello I am going to I with hello am"]
# Count
train_data_features = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()
# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features.toarray(), axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the training set
for tag, count in zip(vocab, dist):
print count, tag
Output:
2 am
1 going
2 hello
1 to
1 with
Part of the code was taken from this Kaggle tutorial on bag-of-words.
FYI: How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?