Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

后端 未结 2 1963
孤独总比滥情好
孤独总比滥情好 2021-01-14 17:30

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.

I\'m not using the Twitter API because it doesn\'t look at the tweets by ha

相关标签:
2条回答
  • 2021-01-14 18:06

    Use the dictionary-like access to the Tag's attributes.

    For example, to get the href attribute value:

    links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
    url = link[0]["href"]
    

    Or, if you need to get the href values for every link found:

    links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
    urls = [link["href"] for link in links]
    

    As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:

    soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
    

    You may use:

    soup('a', {'class': 'tweet-timestamp'})
    

    Or, a CSS selector:

    soup.select("a.tweet-timestamp")
    
    0 讨论(0)
  • 2021-01-14 18:22

    Alecxe already explained to use the 'href' key to get the value.

    So I'm going to answer the other part of your questions:

    Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.

    .contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:

    retweetcount = retweets[0].contents[3].contents[1].contents[1].string
    

    This will return the value 4.

    If you want a rather more readable approach, try this:

    retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
    
    favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
    

    This returns 4 and 2 respectively. This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().

    Now you can loop across each tweet and extract this information rather easily.

    0 讨论(0)
提交回复
热议问题