I\'m new to web scraping and have been using BeautifulSoup to scrape daily mortgage rates. However, a lot of the servicer sites that I am trying to scrape return \'none\' or
As rubik said, the rates are loaded dynamically using JS. Luckily, the structure of the content is relatively simple, here is how I analyze it:
Open a new tab in Chrome(or other Browsers) and right-click, then choose view-source
. Switch to Network
tab and check the preserve log
option.
Now, open the website https://www.popular.com/en/mortgages/
. The loaded contents can be seen at the left panel.
Check each item and analyze its Preview
content till you find the one you want to scrap. Here is what I found, the 2.75%
match the rate value of mortgage shown on the website.
Now, switch to Headers
tab and check the Request URL
, this is the final request sent to the server.
The next step is to analyze the Request URL https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1
I guessed that textcolor
and backgroundcolor
indicate css information, so I removed them and found the url is still valid.
Now we have a simpler url:https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1
It is obvious that id_rates
indicate the order of the mortgages rates without any analyzation. The question is: what does t
mean?
This can be answered by analyzing other Preview contents to find the rule. Here I want to skip the process and just give the conclusion.
t=1
indicate Annual interest
, t=2
indicate APR
, t=6
indicate P&I Payment
etc:
After doing these, now you can scrap content from the Request URL directly:
from urllib2 import urlopen
import re
file=urlopen('https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1')
annual_interest = re.findall(r"\d+\.\d+", str(file.read()))[0]
#the annual interest is 2.75
Use pip install html5lib but I think with "pip install bs4(beautifulSoup)everything should be installed automatically If you are using PyCharm like me, after "pip install bs4" in the command line, open Pycharm and go to interpreters setting, add beautifulsooup and html5lib html5lib is like parser it's the same thing as HTML.parser. Both are the parser for more Info here is the beautifulsoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
If you check the page source (for example via view-source:
in Chrome or Firefox, or by writing your html
string to a file) you'll see that the element you are looking for is not there. In fact, the rates are loaded dynamically:
<td>
<span class="text-md text-popular-medium-blue">
<script type="text/javascript" src = "https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1"></script>
</span>
</td>
You can follow the script URL and you'll see that the response is something like the following:
document.write('<div>2.75%</div>')
This response is probably regular enough to be able to use regexes on it.
To get the data you are after you can use selenium in combination with python something like below:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.popular.com/en/mortgages/')
soup = BeautifulSoup(driver.page_source,"lxml")
item = soup.select('.table-responsive')[0].select("span div")[0].text
print(item)
driver.quit()
Result:
2.75%