I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.
I\'ve read on how to scrape the en
You can do it with javascript. If the form is something like:
<form name='myform' ...
Then you can do this in javascript:
<script language="JavaScript">
function submitform()
{
document.myform.submit();
}
</script>
You can use the "onClick" attribute of links or buttons to invoke this code. To invoke it automatically when a page is loaded, use the "onLoad" attribute of the element:
<body onLoad="submitform()" ...>
you'll need to generate a HTTP request containing the data for the form.
The form will look something like:
<form action="submit.php" method="POST"> ... </form>
This tells you the url to request is www.example.com/submit.php and your request should be a POST.
In the form will be several input items, eg:
<input type="text" name="itemnumber"> ... </input>
you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc... This will work fine for GET. POST is a little trickier.
</motivation>
Just follow S.Lott's links for some much easier to use library support :P
Using python, I think it takes the following steps:
this explains form elements in html file
From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
- Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
- Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
The unusual name caught the attention of our host, November 12, 2008.