I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/h
Users add more content to the page (from previous dates) by clicking the
element at the bottom of the page.
So to get your desired content, you could use Selenium to click the id="moreLink"
element or execute some JavaScript to call control.moreData();
in a loop.
For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:
content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
if not "Friday, February 15, 2013" in content:
sel.run_script("control.moreData();")
content = browser.page_source
else:
desired_content_is_loaded = true;
EDIT:
If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source
because it depends how fast async JS happens to be working at that moment.
So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.
browser.get(googleURL)
time.sleep(3)
content = browser.page_source