How to extract a JSON object that was defined in a HTML page javascript block using Python?

前端 未结 3 827
名媛妹妹
名媛妹妹 2020-12-03 01:51

I am downloading HTML pages that have data defined in them in the following way:

... 

        
相关标签:
3条回答
  • 2020-12-03 02:08

    I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.

    from selenium import webdriver
    import json
    import re
    
    url = "http..."
    driver = webdriver.PhantomJS(service_args=['--load-images=no'])
    driver.set_window_size(1120, 550)
    driver.get(url)
    driver.implicitly_wait(1)
    script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)
    
    # split text based on first equal sign and remove trailing script tag and semicolon
    json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
    # only care about first piece of json
    json_text = json_text.split("};")[0] + "}"
    data = json.loads(json_text)
    
    driver.quit()
    

    ```

    0 讨论(0)
  • 2020-12-03 02:30

    BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

    In simple cases you could:

    1. extract <script>'s text using an html parser
    2. assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
    3. assume that the string is a valid json and parse it using json module

    Example:

    #!/usr/bin/env python
    html = """<!doctype html>
    <title>extract javascript object as json</title>
    <script>
    // ..
    window.blog.data = {"activity":{"type":"read"}};
    // ..
    </script>
    <p>some other html here
    """
    import json
    import re
    from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
    soup = BeautifulSoup(html)
    script = soup.find('script', text=re.compile('window\.blog\.data'))
    json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                          script.string, flags=re.DOTALL | re.MULTILINE).group(1)
    data = json.loads(json_text)
    assert data['activity']['type'] == 'read'
    

    If the assumptions are incorrect then the code fails.

    To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

    from slimit import ast  # $ pip install slimit
    from slimit.parser import Parser as JavascriptParser
    from slimit.visitors import nodevisitor
    
    soup = BeautifulSoup(html, 'html.parser')
    tree = JavascriptParser().parse(soup.script.string)
    obj = next(node.right for node in nodevisitor.visit(tree)
               if (isinstance(node, ast.Assign) and
                   node.left.to_ecma() == 'window.blog.data'))
    # HACK: easy way to parse the javascript object literal
    data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
    assert data['activity']['type'] == 'read'
    

    There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

    0 讨论(0)
  • 2020-12-03 02:31

    Something like this may work:

    import re
    
    HTML = """ 
    <html>
        <head>
        ...
        <script type= "text/javascript"> 
    window.blog.data = {"activity":
        {"type":"read"}
        };
        ...
        </script> 
        </head>
        <body>
        ...
        </body>
        </html>
    """
    
    JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)
    
    matches = JSON.search(HTML)
    
    print matches.group(1)
    
    0 讨论(0)
提交回复
热议问题