web scraping to fill out (and retrieve) search forms?

前端 未结 4 1663
孤独总比滥情好
孤独总比滥情好 2021-01-02 17:49

I was wondering if it is possible to \"automate\" the task of typing in entries to search forms and extracting matches from the results. For instance, I have a list of journ

4条回答
  •  一生所求
    2021-01-02 17:59

    There are many tools for web scraping. There is a good firefox plugin called iMacros. It works great and needs no programming knowledge at all. The free version can be downloaded from here: https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/ The best thing about iMacros, is that it can get you started in minutes, and it can also be launched from the bash command line, and can also be called from within bash scripts.

    A more advanced step would be selenium webdrive. The reason I chose selenium is that it is documented in a great way suiting beginners. reading just the following page:

    would get you upand running in no time. Selenium supports java, python, php , c so if you are familiar with any of these languages, you would be familiar with all the commands needed. I prefer webdrive variation of selenium, as it opens a browser, so that you can check the fields and outputs. After setting up the script using webdrive, you can easily migrate the script to IDE, thus running headless.

    To install selenium you can do by typing the command

    sudo easy_install selenium
    

    This will take care of the dependencies and everything needed for you.

    In order to run your script interactively, just open a terminal, and type

    python
    

    you will see the python prompt, >>> and you can type in the commands.

    Here is a sample code which you can paste in the terminal, it will search google for the word cheeses

    package org.openqa.selenium.example;
    
    import org.openqa.selenium.By;
    import org.openqa.selenium.WebDriver;
    import org.openqa.selenium.WebElement;
    import org.openqa.selenium.firefox.FirefoxDriver;
    import org.openqa.selenium.support.ui.ExpectedCondition;
    import org.openqa.selenium.support.ui.WebDriverWait;
    
    public class Selenium2Example  {
        public static void main(String[] args) {
            // Create a new instance of the Firefox driver
            // Notice that the remainder of the code relies on the interface, 
            // not the implementation.
            WebDriver driver = new FirefoxDriver();
    
            // And now use this to visit Google
            driver.get("http://www.google.com");
            // Alternatively the same thing can be done like this
            // driver.navigate().to("http://www.google.com");
    
            // Find the text input element by its name
            WebElement element = driver.findElement(By.name("q"));
    
            // Enter something to search for
            element.sendKeys("Cheese!");
    
            // Now submit the form. WebDriver will find the form for us from the element
            element.submit();
    
            // Check the title of the page
            System.out.println("Page title is: " + driver.getTitle());
    
            // Google's search is rendered dynamically with JavaScript.
            // Wait for the page to load, timeout after 10 seconds
            (new WebDriverWait(driver, 10)).until(new ExpectedCondition() {
                public Boolean apply(WebDriver d) {
                    return d.getTitle().toLowerCase().startsWith("cheese!");
                }
            });
    
            // Should see: "cheese! - Google Search"
            System.out.println("Page title is: " + driver.getTitle());
    
            //Close the browser
            driver.quit();
        }}
    

    I hope that this can give you a head start.

    Cheers :)

提交回复
热议问题