Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

后端未结

关注

 2  1963

孤独总比滥情好

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.

I\'m not using the Twitter API because it doesn\'t look at the tweets by ha

相关标签:

2条回答

故里飘歌

2021-01-14 18:06
Use the dictionary-like access to the Tag's attributes.

For example, to get the href attribute value:
```
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
```
Or, if you need to get the href values for every link found:
```
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
```
As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:
```
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
```
You may use:
```
soup('a', {'class': 'tweet-timestamp'})
```
Or, a CSS selector:
```
soup.select("a.tweet-timestamp")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
抹茶落季

2021-01-14 18:22
Alecxe already explained to use the 'href' key to get the value.

So I'm going to answer the other part of your questions:

Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.

.contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:
```
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
```
This will return the value 4.

If you want a rather more readable approach, try this:
```
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string

favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
```
This returns 4 and 2 respectively. This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().

Now you can loop across each tweet and extract this information rather easily.
0 讨论(0)
发布评论:

提交评论
- 加载中...