Get instagram followers

后端 未结 5 586
醉梦人生
醉梦人生 2020-12-19 23:58

I want to parse a website\'s followers count with BeautifulSoup. This is what I have so far:

username_extract = \'lazada_my\'

url = \'https://www.instagram.         


        
相关标签:
5条回答
  • 2020-12-20 00:19

    I think you can use re module to search the correct count.

    import requests
    import re
    
    username_extract = 'lazada_my'
    
    url = 'https://www.instagram.com/'+ username_extract
    r = requests.get(url)
    m = re.search(r'"followed_by":\{"count":([0-9]+)\}', str(r.content))
    print(m.group(1))
    
    0 讨论(0)
  • 2020-12-20 00:19

    You have to look for the scripts, Then look for the 'window._sharedData' exits in it. If exits then perform the regular expression operation.

    import re
    
    username_extract = 'lazada_my'
    url = 'https://www.instagram.com/'+ username_extract
    r = requests.get(url)
    soup = BeautifulSoup(r.content,'lxml')
    s = re.compile(r'"followed_by":{"count":\d*}')
    for i in soup.find_all('script'):
         if 'window._sharedData' in str(i):
             print s.search(str(i.contents)).group()
    

    Result,

    "followed_by":{"count":407426}
    
    0 讨论(0)
  • 2020-12-20 00:27

    Thank you all, I ended up using William's solution. In case anybody will have future projects, here is my complete code for scraping a bunch of URL's for their follower count:

    import requests
    import csv 
    import pandas as pd
    import re
    
    insta = pd.read_csv('Instagram.csv')
    
    username = []
    
    bad_urls = [] 
    
    for lines in insta['Instagram'][0:250]:
        lines = lines.split("/")
        username.append(lines[3])
    
    with open('insta_output.csv', 'w') as csvfile:
    t = csv.writer(csvfile, delimiter=',')     #   ----> COMMA Seperated
    for user in username:
       try:
           url = 'https://www.instagram.com/'+ user
           r = requests.get(url)
           m = re.search(r'"followed_by":\{"count":([0-9]+)\}', str(r.content))
           num_followers = m.group(1)
           t.writerow([user,num_followers])    #  ----> Adding Rows
       except:
           bad_urls.append(url)
    
    0 讨论(0)
  • 2020-12-20 00:30

    soup.find('head', attrs={'class':'count'}) searches for something that looks like <head class="count">, which doesn't exist anywhere in the HTML. The data you're after is contained in the <script> tag that starts with window._sharedData:

    script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
    

    From there, you can just strip off the variable assignment and the semicolon to get valid JSON:

    # <script>window._sharedData = ...;</script>
    #                              ^^^
    #                              JSON
    
    page_json = script.text.split(' = ', 1)[1].rstrip(';')
    

    Parse it and everything you need is contained in the object:

    import json
    
    data = json.loads(page_json)
    follower_count = data['entry_data']['ProfilePage'][0]['user']['followed_by']['count']
    
    0 讨论(0)
  • 2020-12-20 00:31

    Most of the content is dynamically generated with JS. That's the reason you're getting empty results.

    But, the followers count is present in the page source. Only thing is, it is not directly available in the form you want. You can see it here:

    <meta content="407.4k Followers, 27 Following, 2,740 Posts - See Instagram photos and videos from Lazada Malaysia (@lazada_my)" name="description" />
    

    If you want to scrape the followers count without regex, you can use this:

    >>> followers = soup.find('meta', {'name': 'description'})['content']
    >>> followers
    '407.4k Followers, 27 Following, 2,740 Posts - See Instagram photos and videos from Lazada Malaysia (@lazada_my)'
    >>> followers_count = followers.split('Followers')[0]
    >>> followers_count
    '407.4k '
    
    0 讨论(0)
提交回复
热议问题