BeautifulSoup returning none when element definitely exists

后端未结

关注

 4  1095

野性不改 2021-01-17 00:58

I\'m new to web scraping and have been using BeautifulSoup to scrape daily mortgage rates. However, a lot of the servicer sites that I am trying to scrape return \'none\' or

4条回答

广开言路 (楼主)

2021-01-17 01:13
As rubik said, the rates are loaded dynamically using JS. Luckily, the structure of the content is relatively simple, here is how I analyze it:

Open a new tab in Chrome(or other Browsers) and right-click, then choose view-source. Switch to Network tab and check the preserve log option.

Now, open the website https://www.popular.com/en/mortgages/. The loaded contents can be seen at the left panel.

Check each item and analyze its Preview content till you find the one you want to scrap. Here is what I found, the 2.75% match the rate value of mortgage shown on the website.

Now, switch to Headers tab and check the Request URL, this is the final request sent to the server.

The next step is to analyze the Request URL https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&textcolor=3784D2&backgroundcolor=ffffff&t=1

I guessed that textcolor and backgroundcolor indicate css information, so I removed them and found the url is still valid.

Now we have a simpler url:https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1

It is obvious that id_rates indicate the order of the mortgages rates without any analyzation. The question is: what does t mean?

This can be answered by analyzing other Preview contents to find the rule. Here I want to skip the process and just give the conclusion.

t=1 indicate Annual interest, t=2 indicate APR, t=6 indicate P&I Payment etc:

After doing these, now you can scrap content from the Request URL directly:
```
from urllib2 import urlopen
import  re

file=urlopen('https://apps.popular.com/navs/rates_wm_modx.php?id_rates=1&t=1')
annual_interest = re.findall(r"\d+\.\d+", str(file.read()))[0]
#the annual interest is 2.75
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...