Django: Parse HTML (containing form) to dictionary

问题

I create a html form on the server side.

<form action="." method="POST">
 <input type="text" name="foo" value="bar">
 <textarea name="area">long text</textarea>
 <select name="your-choice">
  <option value="a" selected>A</option>
  <option value="b">B</option>
 </select>
</form>

Desired result:

{
 "foo": "bar",
 "area": "long text",
 "your-choice": "a",
}

The method (parse_form()) I am looking for could be used like this:

response = client.get('/foo/')

# response contains <form> ...</form>

data = parse_form(response.content)

data['my-input']='bar'

response = client.post('/foo/', data)

How to implement parse_form() in Python?

This is not related to Django, nevertheless, there is an feature request in Django, but it was rejected several years ago: https://code.djangoproject.com/ticket/11797

回答1:

Why not just this?:

def parse_form(content):
    import lxml.html
    tree = lxml.html.fromstring(content)
    return dict(tree.forms[0].fields)

I couldn't guess the reason for using a UserDict

One little caveat: I noticed that when the form contains a <select>, the first value is returned when no option is selected; the solution I gave above based on BS returns None instead

回答2:

This is not related to django, just to html parsing. The standard tool for that is the BeautifulSoup (bs4) library.

It parses arbitrary HTML, and is often used in web scrapers (including my own). This question covers parsing html forms: Python beautiful soup form input parsing, and pretty much everything you'll need is answered on here somewhere :)

from bs4 import BeautifulSoup

def selected_option(select):
    option = select.find("option", selected=True)
    if option: 
        return option['value']

# tag name => how to extract its value
tags = {  
    "input": lambda t: t['value'],
    "textarea": lambda t: t.text,
    "select": selected_option
}


def parse_form(html):
    soup = BeautifulSoup(html, 'html.parser')
    form = soup.find("form")
    return {
        e['name']: tags[e.name](e)
        for e in form.find_all(tags.keys())
    }

This gives this output for your input:

{
    "foo": "bar",
    "area": "long text",
    "your-choice": "a"
}

For production, you are going to want to add tons of error checking, for form not found, inputs without name, etc. It depends on what exactly is needed.

回答3:

First of all, consider using response.context instead of response.content. As it is documented here, it gives you the template parameters that were used to render response.content. The form attributes you need (name and value) might be in there if you gave them as parameters to the renderer.

If you must use response.content, then I don't think Django provides a way to parse the HTML response. You can use a HTML parser like beautifulsoup, or do it using regular expressions.

回答4:

from collections import UserDict

class FormData(UserDict):
    def __init__(self, *args, **kwargs):
        self.frozen = False
        super().__init__(*args, **kwargs)
        self.frozen = True
        
    def __setitem__(self, key, value):
        if self.frozen and key not in self:
            raise ValueError('Key %s is not in the dict. Available: %s' % (
                key, self.keys()
            ))
        super().__setitem__(key, value)

def parse_form(content):
    """
    Parse the first form in the html in content.
    """
    
    import lxml.html
    tree = lxml.html.fromstring(content)
    return FormData(tree.forms[0].fields)

Example usage:


def test_foo_form(user_client):
    url = reverse('foo')
    response = user_client.get(url)
    assert response.status_code == 200
    data = parse_form(response.content)
    response = user_client.post(url, data)
    assert response.status_code == 302
    ```

回答5:

Just for fun, I tried to replicate with BeatifulSoap the solution proposed by guettli.

Here's what I came out:

from bs4 import BeautifulSoup


def parse_form(content):
    data = {}
    html = BeautifulSoup(content, features="lxml")
    form = html.find('form', recursive=True)
    fields = form.find_all(('input', 'select', 'textarea'))
    for field in fields:
        name = field.get('name')
        if name:
            if field.name == 'input':
                value = field.get('value')
            elif field.name == 'select':
                try:
                    value = field.find_all('option', selected=True)[0].get('value')
                except:
                    value = None
            elif field.name == 'textarea':
                value = field.text
            else:
                # checkbox ? radiobutton ? file ? 
                continue
            data[name] = value
    return data

Is this a better result?

Honestly, I don't think so; on the other side, if you happen to use BS for parsing the response content in other ways, this might be an option.

来源：https://stackoverflow.com/questions/65570418/django-parse-html-containing-form-to-dictionary

标签

python

django

html-parsing