Webscraping Instagram follower count BeautifulSoup

后端未结

关注

 5  1998

I\'m just starting to learn how to web scrape using BeautifulSoup and want to write a simple program that will get the follower count for a given Instagram

相关标签:

5条回答

生来不讨喜

2021-01-18 21:59

Here is my approach ( the html source code has a json object that has all the data of the profile )

import json
import urllib.request, urllib.parse
from bs4 import BeautifulSoup   

req      = urllib.request.Request(myurl)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36')
html     = urllib.request.urlopen(req).read()
response = BeautifulSoup(html, 'html.parser')
jsonObject = response.select("body > script:nth-of-type(1)")[0].text.replace('window._sharedData =','').replace(';','')
data      = json.loads(jsonObject)
following = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_follow']['count']
followed  = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count']
posts     = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['count']
username  = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['owner']['username']

0 讨论(0)

走了就别回头了

2021-01-18 22:01
Instagram always responds with JSON data, making it a usually cleaner option to obtain metadata from the JSON, rather than parsing the HTML response with BeautifulSoup. Given that using BeatifulSoup is not a constraint, there are at least two clean options to get the follower count of an Instagram profile:
1. Obtain the profile page, search the JSON and parse it:
```
import json
import re
import requests

response = requests.get('https://www.instagram.com/' + PROFILE)
json_match = re.search(r'window\._sharedData = (.*);</script>', response.text)
profile_json = json.loads(json_match.group(1))['entry_data']['ProfilePage'][0]['graphql']['user']

print(profile_json['edge_followed_by']['count'])
```
  Then, profile_json variable contains the profile's metadata, not only the follower count.
2. Use a library, leaving changes of Instagram's responses the upstream's problem. There is Instaloader, which can be used liked this:
```
from instaloader import Instaloader, Profile

L = Instaloader()
profile = Profile.from_username(L.context, PROFILE)

print(profile.followers)
```
  It also supports logging in, allowing to access private profiles as well.
  
  (disclaimer: I am authoring this tool)
Either way, you obtain a structure containing the profile's metadata, without needing to do strange things to the html response.
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2021-01-18 22:07

Although this is not really a general question on programming, you should find that the exact follower count is the title property of the span element containing the formatted follower count. You can query this property.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2021-01-18 22:08

The easist method to do this would be to dump the page html into a text editor and do a text search for the exact number of followers the person has. You can then zero into the element which contains the number.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2021-01-18 22:15
Use the API is the easiest way, but I also found a very hacky way to do it:
```
import requests

username = "espn"
url = 'https://www.instagram.com/' + username
r = requests.get(url).text

start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
followers= r[r.find(start)+len(start):r.rfind(end)]

start = '"edge_follow":{"count":'
end = '},"follows_viewer"'
following= r[r.find(start)+len(start):r.rfind(end)]

print(followers, following)
```
If you look through the response requests gives, theres a line of Javascript that contains the real follower count:

...edge_followed_by":{"count":10770969},"followed_by_viewer":{...

So I just extracted the number by finding the substring before and after.
0 讨论(0)
发布评论:

提交评论
- 加载中...