Find Most Common Words from a Website in Python 3 [closed]

问题

I need to find and copy those words that appears over 5 times on a given website using Python 3 code and I'm not sure how to do it. I've looked through the archives here on stack overflow but other solutions rely on python 2 code. Here's the measly code I have so far:

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

Does anyone have any advice on what to do? I have NLTK installed and I've looked into beautiful soup but for the life of me, I have no idea how to install it correctly (I'm very python-green)! As I am learning, any explanation would also be very much appreciated. Thank you :)

回答1:

This is not perfect but an idea of how to get you started using requests, BeautifulSoup and collections.Counter

import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

r = requests.get("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

soup = BeautifulSoup(r.content)

text = (''.join(s.findAll(text=True))for s in soup.findAll('p'))

c = Counter((x.rstrip(punctuation).lower() for y in text for x in y.split()))
print (c.most_common()) # prints most common words staring at most common.

[('the', 279), ('and', 192), ('in', 175), ('of', 168), ('his', 140), ('a', 124), ('to', 103), ('mozart', 82), ('was', 77), ('he', 70), ('with', 53), ('as', 50), ('for', 40), ("mozart's", 39), ('on', 35), ('from', 34), ('at', 31), ('by', 31), ('that', 26), ('is', 23), ('k.', 21), ('an', 20), ('had', 20), ('were', 20), ('but', 19), ('which',.............

print ([x for x in c if c.get(x) > 5]) # words appearing more than 5 times

['there', 'but', 'both', 'wife', 'for', 'musical', 'salzburg', 'it', 'more', 'first', 'this', 'symphony', 'wrote', 'one', 'during', 'mozart', 'vienna', 'joseph', 'in', 'later', 'salzburg,', 'other', 'such', 'last', 'needed]', 'only', 'their', 'including', 'by', 'music,', 'at', "mozart's", 'mannheim,', 'composer', 'and', 'are', 'became', 'four', 'premiered', 'time', 'did', 'the', 'not', 'often', 'is', 'have', 'began', 'some', 'success', 'court', 'that', 'performed', 'work', 'him', 'leopold', 'these', 'while', 'been', 'new', 'most', 'were', 'father', 'opera', 'as', 'who', 'classical', 'k.', 'to', 'of', 'has', 'many', 'was', 'works', 'which', 'early', 'three', 'family', 'on', 'a', 'when', 'had', 'december', 'after', 'he', 'no.', 'year', 'from', 'great', 'period', 'music', 'with', 'his', 'composed', 'minor', 'two', 'number', '1782', 'an', 'piano']

回答2:

So, this is coming from a newbie, but if you just need a quick answer, I think this might work. Please note that with this method, you cannot just put in the URL with the program, you have to manually paste it in the code. (sorry).

text = '''INSERT TEXT HERE'''.split() #Where you see "INSERT TEXT HERE", that's where the text goes.
#also note the .split() method at the end. This converts the text into a list, splitting every word in between the spaces. 
#for example, "red dog food".split() would be ['red','dog','food']
overusedwords = [] #this is where the words that are used 5 or more times are going to be held.
for i in text: #this will iterate through every single word of the text
    if text.count(i) >= 5 and overusedwords.count(i) == 0: #(1. Read below)
        overusedwords.append(i) #this adds the word to the list of words used 5 or more times
if len(overusedwords) > 0: #if there are no words used 5 or more times, it doesn't print anything useless.
    print('The overused words are:')
    for i in overusedwords:
        print(i)
else:
    print('No words used 5 or more times.') #just in case there are no words used 5 or more times

For the explanation of the "text.count(i) >= 5 part. For every time it iterates through the for loop, it checks to see if there are five or more of that specific word used in the text. Then, for "and overusedwords.count(i) == 0:", this just makes sure that the same word isn't being added twice to the list of overused words. Hope I helped. I'm thinking that you might have wanted a method where you could get this information straight from typing in the url, but this might help other beginners that have a similar question.

回答3:

I'd do it like this:

Install BeautifulSoup, which is explained here.

You need these imports:

from bs4 import BeautifulSoup
import re
from collections import Counter

Grab the visible text on the site with BeautifulSoup, which is explained on stackoverflow here.
Get a list lst of words from the visible text with
```
re.findall(r'\b\w+', visible_text_string)
```
Convert every word to lower-case
```
lst = [x.lower() for x in lst]
```

Count the occurrences of each word and make a list of (word, count) tuples.

counter = Counter(lst)
occs = [(word,count) for word,count in counter.items() if count > 5]

Sort occs by occurence:
```
occs.sort(key=lambda x:x[1])
```

回答4:

scrapy, urllib, urllib2 and BeautifulSoup are your friends when it comes to munging data off websites.

It depends on the individual site and where the author(s) of the site puts the text on the page. Mostly you are able to find text in <p>...</p>.

For example in this site (http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html), the text you want is:

If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state.

It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.

Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.

There are other texts on the page but normally, you would only want the main text and not the navigate bars and the boilerplates on the page.

You can get it simply by:

>>> import urllib2
>>> from bs4 import BeautifulSoup as bsoup
>>> url = "http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html"
>>> page = urllib2.urlopen(url).read()
>>> for i in bsoup(page).find_all('p'):
...     print i.text.strip()
... 

If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state.
It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.
Find us on       Facebook      Twitter      Youtube      Wikipedia     Singapore Reviews

Copyright © 2013 Singapore Tourism Board. Website Terms of Use   |   Privacy Statement   |   Photo Credits

You realized that you got more than what you really need so you can sift the bsoup(page).find_all() it even further by getting the <div class="paragraph section">...</div> before accessing the paragraph inside it:

>>> for i in bsoup(page).find_all(attrs={'class':'paragraph section'}):
...     print i.text.strip()
... 
If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state. 
It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.

And voila, there you have the text. But as said before, how to munge the main text from the page depends on how the page is written.

Here's the full code:

>>> import urllib2
>>> from collections import Counter
>>> from nltk import word_tokenize
>>> from bs4 import BeautifulSoup as bsoup
>>> page = urllib2.urlopen(url).read()
>>> text = " ".join([i.text.strip() for i in bsoup(page).find_all(attrs={'class':'paragraph section'})])
>>> word_freq = Counter(word_tokenize(text))
>>> word_freq['Zouk'] 4
>>> word_freq.most_common() [(u',', 8), (u'and', 8), (u'to', 4), (u'of', 4), (u'Zouk', 4), (u'is', 4), (u'the', 4), (u'its', 3), (u'has', 3), (u'in', 3), (u'a', 3), (u'only', 2), (u'for', 2), (u'one', 2), (u'clubs', 2), (u'exclusive', 1), (u'all', 1), (u'Velvet', 1), (u'just', 1), (u'dance', 1), (u'global', 1), (u'rest', 1), (u'Chemical', 1), (u'Oakenfold', 1), (u'it\u2019s', 1), (u'young', 1), (u'passage', 1), (u'main', 1), (u'neighbouring', 1), (u'then', 1), (u'than', 1), (u'means', 1), (u'famous', 1), (u'made', 1), (u'world', 1), (u'like', 1), (u'DJs', 1), (u'bar', 1), (u'name', 1), (u'countries', 1), (u'night', 1), (u'showcasing', 1), (u'Paul', 1), (u'people', 1), (u'house', 1), (u'ZoukOut.', 1), (u'up', 1), (u'\u2013', 1), (u'Underground', 1), (u'home', 1), (u'even', 1), (u'Singapore', 1), (u'city-state.', 1), (u'retro', 1), (u'international', 1), (u'rite', 1), (u'be', 1), (u'institution', 1), (u'reason', 1), (u'techno', 1), (u'both', 1), (u'nightspot', 1), (u'festival', 1), (u'experimental', 1), (u'Singapore\u2019s', 1), (u'own', 1), (u'savour', 1), (u'suggests.', 1), (u'Zouk\u2019s', 1), (u'simply', 1), (u'another', 1), (u'Probably', 1), (u'Jambo', 1), (u'spawned', 1), (u'from', 1), (u'Brothers', 1), (u'remains', 1), (u'leading', 1), (u'.', 1), (u'Phuture', 1), (u'Carl', 1), (u'more', 1), (u'on', 1), (u'club', 1), (u'relaxed', 1), (u'If', 1), (u'with', 1), (u'Wednesdays', 1), (u'room', 1), (u'Primal', 1), (u'while', 1), (u'three', 1), (u'at', 1), (u'racier', 1), (u'it', 1), (u'an', 1), (u'Zouk.', 1), (u'as', 1), (u'manner', 1), (u'have', 1), (u'nights', 1), (u'Malaysia', 1), (u'holds', 1), (u'also', 1), (u'other', 1), (u'repute', 1), (u'you', 1), (u'several', 1), (u'Sentosa\u2019s', 1), (u'Cox', 1), (u'Mambo', 1), (u'why', 1), (u'It', 1), (u'reputation', 1), (u'time', 1), (u'Scream.', 1), (u'music.', 1), (u'wine', 1)]

The above example comes from:

Liling Tan and Francis Bond. 2011. Building and annotating the linguistically diverse NTU-MC (NTU-multilingual corpus). In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25). Singapore.

来源：https://stackoverflow.com/questions/24396406/find-most-common-words-from-a-website-in-python-3

标签

python

beautifulsoup

web-crawler

nltk