Extracting data from script tag using BeautifulSoup in Python

前端 未结 2 806
小蘑菇
小蘑菇 2020-12-21 07:09

I want to extract \"SNG_TITLE\" and \"ART_NAME\" values from the code in \"script\" tag using BeautifulSoup in Python. (the whole script is too long to paste)



        
相关标签:
2条回答
  • 2020-12-21 07:38

    If my understanding is correct, you want only the script element with "SNG_TITLE" in it.

    You can use re and get only the script element with the fields of your interest as follows:

    import requests
    from bs4 import BeautifulSoup
    import re
    
    base_url = 'https://www.deezer.com/en/profile/1589856782/loved'
    
    r = requests.get(base_url)
    
    soup = BeautifulSoup(r.text, 'html.parser')
    
    user_name = soup.find(class_='user-name')
    print(user_name.text)
    
    for script in soup(text=re.compile(r'SNG_TITLE' )):
        print(script.parent)
    

    EDIT:

    @furas answer is the complete solution using json to find the 'SNG_TITLE' and 'ART_TITLE'. My answer help you find only the script with 'SNG_TITLE'. You can combine both to get better code.

    0 讨论(0)
  • 2020-12-21 07:53

    Scripts don't change places in code so you can count them and use index to get correct script.

    all_scripts[6]
    

    Script is normal string so you can also use standard string functions ie.

    if '{"loved"' in script.text:
    

    Code with both methods - I use [:100] to display only part of string.

    import requests
    from bs4 import BeautifulSoup
    
    base_url = 'https://www.deezer.com/en/profile/1589856782/loved'
    
    r = requests.get(base_url)
    
    soup = BeautifulSoup(r.text, 'html.parser')
    
    all_scripts = soup.find_all('script')
    
    print('--- first method ---')
    print(all_scripts[6].text[:100])
    
    print('--- second method ---')
    for number, script in enumerate(all_scripts):
        if '{"loved"' in script.text:
            print(number, script.text[:100])
    

    Result:

    --- first method ---
    window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
    --- second method ---
    6 window.__DZR_APP_STATE__ = {"TAB":{"loved":{"data":[{"SNG_ID":"126884459","PRODUCT_TRACK_ID":"360276
    

    EDIT: When you have correct script then you can use slicing to get only JSON string and use module json to convert it to python dictionary and then tou can get data

    import requests
    from bs4 import BeautifulSoup
    import json
    
    base_url = 'https://www.deezer.com/en/profile/1589856782/loved'
    
    r = requests.get(base_url)
    
    soup = BeautifulSoup(r.text, 'html.parser')
    
    all_scripts = soup.find_all('script')
    
    data = json.loads(all_scripts[6].get_text()[27:])
    
    print('key:', data.keys())
    print('key:', data['TAB'].keys())
    print('key:', data['DATA'].keys())
    print('---')
    
    for item in data['TAB']['loved']['data']:
        print('ART_NAME:', item['ART_NAME'])
        print('SNG_TITLE:', item['SNG_TITLE'])
        print('---')
    

    Result:

    key: dict_keys(['TAB', 'DATA'])
    key: dict_keys(['loved'])
    key: dict_keys(['USER', 'FOLLOW', 'FOLLOWING', 'HAS_BLOCKED', 'IS_BLOCKED', 'IS_PUBLIC', 'CURATOR', 'IS_PERSONNAL', 'NB_FOLLOWER', 'NB_FOLLOWING'])
    ---
    ART_NAME: Twenty One Pilots
    SNG_TITLE: Heathens
    ---
    ART_NAME: Twenty One Pilots
    SNG_TITLE: Stressed Out
    ---
    ART_NAME: Linkin Park
    SNG_TITLE: Numb
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Animal I Have Become
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Painkiller
    ---
    ART_NAME: Slipknot
    SNG_TITLE: Before I Forget
    ---
    ART_NAME: Slipknot
    SNG_TITLE: Duality
    ---
    ART_NAME: Skrillex
    SNG_TITLE: Make It Bun Dem
    ---
    ART_NAME: Skrillex
    SNG_TITLE: Bangarang (feat. Sirah)
    ---
    ART_NAME: Limp Bizkit
    SNG_TITLE: Break Stuff
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: I Hate Everything About You
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Time of Dying
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: I Am Machine
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Riot
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: So What
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Pain
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Tell Me Why
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Chalk Outline
    ---
    ART_NAME: Three Days Grace
    SNG_TITLE: Gone Forever
    ---
    ART_NAME: Slipknot
    SNG_TITLE: The Devil In I
    ---
    ART_NAME: Linkin Park
    SNG_TITLE: No More Sorrow
    ---
    ART_NAME: Linkin Park
    SNG_TITLE: Bleed It Out
    ---
    ART_NAME: The Doors
    SNG_TITLE: Roadhouse Blues
    ---
    ART_NAME: The Doors
    SNG_TITLE: Riders On The Storm
    ---
    ART_NAME: The Doors
    SNG_TITLE: Break On Through (To The Other Side)
    ---
    ART_NAME: The Doors
    SNG_TITLE: Alabama Song (Whisky Bar)
    ---
    ART_NAME: The Doors
    SNG_TITLE: People Are Strange
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: Welcome to the Black Parade
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: Teenagers
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: Na Na Na [Na Na Na Na Na Na Na Na Na]
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: Famous Last Words
    ---
    ART_NAME: The Doors
    SNG_TITLE: Soul Kitchen
    ---
    ART_NAME: The Black Keys
    SNG_TITLE: Lonely Boy
    ---
    ART_NAME: Katy Perry
    SNG_TITLE: I Kissed a Girl
    ---
    ART_NAME: Katy Perry
    SNG_TITLE: Hot N Cold
    ---
    ART_NAME: Katy Perry
    SNG_TITLE: E.T.
    ---
    ART_NAME: Linkin Park
    SNG_TITLE: Given Up
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: Dead!
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: Mama
    ---
    ART_NAME: My Chemical Romance
    SNG_TITLE: The Sharpest Lives
    ---
    
    0 讨论(0)
提交回复
热议问题