Can Jsoup simulate a button press?

后端 未结 3 1468
情话喂你
情话喂你 2020-12-01 17:09

Can you use Jsoup to submit a search to Google, but instead of sending your request via \"Google Search\" use \"I\'m Feeling Lucky\"? I would like to capture the name of th

相关标签:
3条回答
  • 2020-12-01 17:18

    I'd try HtmlUnit for navigating trough a site, and JSOUP for scraping

    0 讨论(0)
  • 2020-12-01 17:31

    Yes it can, if you are able to figure out how Google search queries are made. But this is not allowed by Google, even if you would success with that. You should use their official API to make automated search queries.

    http://code.google.com/intl/en-US/apis/customsearch/v1/overview.html

    0 讨论(0)
  • 2020-12-01 17:34

    According to the HTML source of http://google.com the "I am feeling lucky" button has a name of btnI:

    <input value="I'm Feeling Lucky" name="btnI" type="submit" onclick="..." />
    

    So, just adding the btnI parameter to the query string should do (the value doesn't matter):

    http://www.google.com/search?hl=en&btnI=1&q=your+search+term

    So, this Jsoup should do:

    String url = "http://www.google.com/search?hl=en&btnI=1&q=balusc";
    Document document = Jsoup.connect(url).get();
    System.out.println(document.title());
    

    However, this gave a 403 (Forbidden) error.

    Exception in thread "main" java.io.IOException: 403 error loading URL http://www.google.com/search?hl=en&btnI=1&q=balusc
        at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:387)
        at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
        at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
        at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
        at test.Test.main(Test.java:17)
    

    Perhaps Google was sniffing the user agent and discovering it to be Java. So, I changed it:

    String url = "http://www.google.com/search?hl=en&btnI=1&q=balusc";
    Document document = Jsoup.connect(url).userAgent("Mozilla").get();
    System.out.println(document.title());
    

    This yields (as expected):

    The BalusC Code

    The 403 is however an indication that Google isn't necessarily happy with bots like that. You might get (temporarily) IP-banned when you do this too often.

    0 讨论(0)
提交回复
热议问题