I\'m just starting to learn how to web scrape using BeautifulSoup and want to write a simple program that will get the follower count for a given Instagram
Here is my approach ( the html source code has a json object that has all the data of the profile )
import json
import urllib.request, urllib.parse
from bs4 import BeautifulSoup
req = urllib.request.Request(myurl)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36')
html = urllib.request.urlopen(req).read()
response = BeautifulSoup(html, 'html.parser')
jsonObject = response.select("body > script:nth-of-type(1)")[0].text.replace('window._sharedData =','').replace(';','')
data = json.loads(jsonObject)
following = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_follow']['count']
followed = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count']
posts = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['count']
username = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][0]['node']['owner']['username']
Instagram always responds with JSON data, making it a usually cleaner option to obtain metadata from the JSON, rather than parsing the HTML response with BeautifulSoup. Given that using BeatifulSoup is not a constraint, there are at least two clean options to get the follower count of an Instagram profile:
Obtain the profile page, search the JSON and parse it:
import json
import re
import requests
response = requests.get('https://www.instagram.com/' + PROFILE)
json_match = re.search(r'window\._sharedData = (.*);</script>', response.text)
profile_json = json.loads(json_match.group(1))['entry_data']['ProfilePage'][0]['graphql']['user']
print(profile_json['edge_followed_by']['count'])
Then, profile_json variable contains the profile's metadata, not only the follower count.
Use a library, leaving changes of Instagram's responses the upstream's problem. There is Instaloader, which can be used liked this:
from instaloader import Instaloader, Profile
L = Instaloader()
profile = Profile.from_username(L.context, PROFILE)
print(profile.followers)
It also supports logging in, allowing to access private profiles as well.
(disclaimer: I am authoring this tool)
Either way, you obtain a structure containing the profile's metadata, without needing to do strange things to the html response.
Although this is not really a general question on programming, you should find that the exact follower count is the title
property of the span
element containing the formatted follower count. You can query this property.
The easist method to do this would be to dump the page html into a text editor and do a text search for the exact number of followers the person has. You can then zero into the element which contains the number.
Use the API is the easiest way, but I also found a very hacky way to do it:
import requests
username = "espn"
url = 'https://www.instagram.com/' + username
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
followers= r[r.find(start)+len(start):r.rfind(end)]
start = '"edge_follow":{"count":'
end = '},"follows_viewer"'
following= r[r.find(start)+len(start):r.rfind(end)]
print(followers, following)
If you look through the response requests gives, theres a line of Javascript that contains the real follower count:
...edge_followed_by":{"count":10770969},"followed_by_viewer":{
...
So I just extracted the number by finding the substring before and after.