Scraping Instagram with BeautifulSoup

前端 未结 1 1917
春和景丽
春和景丽 2021-01-29 05:18

I\'m trying to get a particular string from the \"search by tag\" in Instagram. I\'d like to get the url img from here:

\"#yeşil

        
相关标签:
1条回答
  • 2021-01-29 05:53

    The reason you can't see any output is that the images are added dynamically to the page source using JavaScript. So, the HTML that you've provided isn't available in the page source. Easiest way to overcome this is to use Selenium.

    But, there's one more way to scrape that. Looking at the page source, the data you're after, is available in a <script> tag in the form of JSON. The relevant data is in the form of:

    "thumbnail_resources": [
        {
            "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/a3ed0ee1af581f1c1fe6170b8c080e7c/5B2CA660/t51.2885-15/s150x150/e35/28433503_571483933190064_5347634166450094080_n.jpg",
             "config_width": 150,
             "config_height": 150
         },
         {
             "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/7a0bb4fb1b5d5e3b179c58a2b9472b9f/5B2C535F/t51.2885-15/s240x240/e35/28433503_571483933190064_5347634166450094080_n.jpg",
             "config_width": 240,
             "config_height": 240
         },
    

    To get the JSON, you can use this (code taken from this answer):

    script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
    page_json = script.text.split(' = ', 1)[1].rstrip(';')
    data = json.loads(page_json)
    

    Code to get image link for all the images:

    import json
    import requests
    from bs4 import BeautifulSoup
    
    r = requests.get('https://www.instagram.com/explore/tags/nature/')
    soup = BeautifulSoup(r.text, 'lxml')
    
    script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
    page_json = script.text.split(' = ', 1)[1].rstrip(';')
    data = json.loads(page_json)
    
    for post in data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
        image_src = post['node']['thumbnail_resources'][1]['src']
        print(image_src)
    

    Partial output:

    https://instagram.fpnq3-1.fna.fbcdn.net/vp/e8a78407fb61de834cad7f10eca830fc/5A9DC375/t51.2885-15/s240x240/e15/c0.80.640.640/28766397_174603559842180_1092148752455565312_n.jpg
    https://instagram.fpnq3-1.fna.fbcdn.net/vp/3a20f36647c86c2196f259b5d14ebf82/5A9D5BC9/t51.2885-15/s240x240/e15/28433802_283862648812409_3322859933120069632_n.jpg
    https://instagram.fpnq3-1.fna.fbcdn.net/vp/82216be4596dd9da862ba267cdeab517/5B144226/t51.2885-15/s240x240/e35/c0.135.1080.1080/28157436_941679549319762_5605299824451649536_n.jpg
    https://instagram.fpnq3-1.fna.fbcdn.net/vp/e50eab90b2e0951d67922e49b495e1fc/5B3EC9B8/t51.2885-15/s240x240/e35/c135.0.810.810/28754107_179533402825352_1137703808411893760_n.jpg
    https://instagram.fpnq3-1.fna.fbcdn.net/vp/d3a13e7b81a65421b4318b57fb8ee24e/5B4D9EFF/t51.2885-15/s240x240/e35/28433583_375555202918683_1951892035636035584_n.jpg
    https://instagram.fpnq3-1.fna.fbcdn.net/vp/1b0aeea1b9be983498192d350e039aa0/5B43C583/t51.2885-15/s240x240/e35/28156427_154249191953160_9219472301039288320_n.jpg
    ...
    

    Note: The [1] in the line image_src = post['node']['thumbnail_resources'][1]['src'] is for 240w. You can use 0, 1, 2, 3 or 4 for 150w, 240w, 320w, 480w or 640w respectively. Also, if you want any other data regarding any image, like, number of likes, comments, caption, etc; everything is available in this JSON (data variable).

    0 讨论(0)
提交回复
热议问题