Extracting text from script tag using BeautifulSoup in Python

后端 未结 2 1828
自闭症患者
自闭症患者 2020-11-27 22:17

Could you please help me with this lil thing. I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python)

相关标签:
2条回答
  • 2020-11-27 22:41

    Alternatively to the regex-based approach, you can parse the javascript code using slimit module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:

    from bs4 import BeautifulSoup
    from slimit import ast
    from slimit.parser import Parser
    from slimit.visitors import nodevisitor
    
    
    data = """
    <html>
        <head>
            <title>My Sample Page</title>
            <script>
            $.ajax({
                type: "POST",
                url: 'http://www.example.com',
                data: {
                    email: 'abc@g.com',
                    phone: '9999999999',
                    name: 'XYZ'
                }
            });
            </script>
        </head>
        <body>
            <h1>What a wonderful world</h1>
        </body>
    </html>
    """
    
    # get the script tag contents from the html
    soup = BeautifulSoup(data)
    script = soup.find('script')
    
    # parse js
    parser = Parser()
    tree = parser.parse(script.text)
    fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
              for node in nodevisitor.visit(tree)
              if isinstance(node, ast.Assign)}
    
    print fields
    

    Prints:

    {u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}
    

    Among other fields, there are email, name and phone that you are interested in.

    Hope that helps.

    0 讨论(0)
  • 2020-11-27 22:51

    You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

    Working example (based on what you've described in the question):

    import re
    from bs4 import BeautifulSoup
    
    data = """
    <html>
        <head>
            <title>My Sample Page</title>
            <script>
            $.ajax({
                type: "POST",
                url: 'http://www.example.com',
                data: {
                    email: 'abc@g.com',
                    phone: '9999999999',
                    name: 'XYZ'
                }
            });
            </script>
        </head>
        <body>
            <h1>What a wonderful world</h1>
        </body>
    </html>
    """
    
    soup = BeautifulSoup(data)
    script = soup.find('script')
    
    pattern = re.compile("(\w+): '(.*?)'")
    fields = dict(re.findall(pattern, script.text))
    print fields['email'], fields['phone'], fields['name']
    

    Prints:

    abc@g.com 9999999999 XYZ
    

    I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.


    UPD (fixing the code OP provided):

    soup = BeautifulSoup(data, 'html.parser')
    script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))
    
    pattern = re.compile("(\w+): '(.*?)'")
    fields = dict(re.findall(pattern, script.text))
    print fields['email'], fields['phone'], fields['name']
    

    prints:

    abcd@gmail.com 9999999999 Shamita Shetty
    
    0 讨论(0)
提交回复
热议问题