I\'m new to web scraping and have been using BeautifulSoup to scrape daily mortgage rates. However, a lot of the servicer sites that I am trying to scrape return \'none\' or
As rubik said, the rates are loaded dynamically using JS. Luckily, the structure of the content is relatively simple, here is how I analyze it:
Open a new tab in Chrome(or other Browsers) and right-click, then choose view-source
. Switch to Network
tab and check the preserve log
option.
Now, open the website https://www.popular.com/en/mortgages/
. The loaded contents can be seen at the left panel.
Check each item and analyze its Preview
content till you find the one you want to scrap. Here is what I found, the 2.75%
match the rate value of mortgage shown on the website.
Now, switch to Headers
tab and check the Request URL
, this is the final request sent to the server.
The next step is to analyze the Request URL https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1
I guessed that textcolor
and backgroundcolor
indicate css information, so I removed them and found the url is still valid.
Now we have a simpler url:https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1
It is obvious that id_rates
indicate the order of the mortgages rates without any analyzation. The question is: what does t
mean?
This can be answered by analyzing other Preview contents to find the rule. Here I want to skip the process and just give the conclusion.
t=1
indicate Annual interest
, t=2
indicate APR
, t=6
indicate P&I Payment
etc:
After doing these, now you can scrap content from the Request URL directly:
from urllib2 import urlopen
import re
file=urlopen('https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1')
annual_interest = re.findall(r"\d+\.\d+", str(file.read()))[0]
#the annual interest is 2.75