Scraping Instagram with BeautifulSoup

前端未结

关注

 1  1917

I\'m trying to get a particular string from the \"search by tag\" in Instagram. I\'d like to get the url img from here:


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-01-29 05:53
              
            
            
                                                                       
The reason you can't see any output is that the images are added dynamically to the page source using JavaScript. So, the HTML that you've provided isn't available in the page source. Easiest way to overcome this is to use Selenium.

But, there's one more way to scrape that. Looking at the page source, the data you're after, is available in a <script> tag in the form of JSON. The relevant data is in the form of:

"thumbnail_resources": [
    {
        "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/a3ed0ee1af581f1c1fe6170b8c080e7c/5B2CA660/t51.2885-15/s150x150/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 150,
         "config_height": 150
     },
     {
         "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/7a0bb4fb1b5d5e3b179c58a2b9472b9f/5B2C535F/t51.2885-15/s240x240/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 240,
         "config_height": 240
     },


To get the JSON, you can use this (code taken from this answer):

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)


Code to get image link for all the images:

import json
import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.instagram.com/explore/tags/nature/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

for post in data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
    image_src = post['node']['thumbnail_resources'][1]['src']
    print(image_src)


Partial output:

https://instagram.fpnq3-1.fna.fbcdn.net/vp/e8a78407fb61de834cad7f10eca830fc/5A9DC375/t51.2885-15/s240x240/e15/c0.80.640.640/28766397_174603559842180_1092148752455565312_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/3a20f36647c86c2196f259b5d14ebf82/5A9D5BC9/t51.2885-15/s240x240/e15/28433802_283862648812409_3322859933120069632_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/82216be4596dd9da862ba267cdeab517/5B144226/t51.2885-15/s240x240/e35/c0.135.1080.1080/28157436_941679549319762_5605299824451649536_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/e50eab90b2e0951d67922e49b495e1fc/5B3EC9B8/t51.2885-15/s240x240/e35/c135.0.810.810/28754107_179533402825352_1137703808411893760_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/d3a13e7b81a65421b4318b57fb8ee24e/5B4D9EFF/t51.2885-15/s240x240/e35/28433583_375555202918683_1951892035636035584_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/1b0aeea1b9be983498192d350e039aa0/5B43C583/t51.2885-15/s240x240/e35/28156427_154249191953160_9219472301039288320_n.jpg
...


Note: The [1] in the line image_src = post['node']['thumbnail_resources'][1]['src'] is for 240w. You can use 0, 1, 2, 3 or 4 for 150w, 240w, 320w, 480w or 640w respectively. Also, if you want any other data regarding any image, like, number of likes, comments, caption, etc; everything is available in this JSON (data variable).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复