How to unit test a web scraping service php unit

こ雲淡風輕ζ 提交于 2021-01-29 22:56:28

问题


I am currently developing a project in PHP + Laravel that needs to scrape data from two different websites. I am using the Goutte Scraping Library. I have 10 integration tests, where I use the Crawler object that Goutte's Client provide in order to get the specific data I want to scrape from each website.

The tests work just fine (I even used infection library for mutant testing)... But the thing is that I thik there could be a way to unit test all the functions (therefore, the tests would run faster).

The approach I tried to follow is to scrape all the html file from each of both websites and assert that the scrapped html equals to a local html file that I would have locally on my project and that would be the same html as the scrapped one. Therefore, if my local html and the scrapped html are the same, I could just pass the data from my local html to the functions that target spacific html tags to retrieve the info I want. I hope this make sense

I hope my code can elighten you guys a bit more:

My test class look like this:

    private $html;
    
    protected function setUp() :void
    {
        $myHtml= fopen("path\myLocal.html", "r");
        $this->html =  fread($myHtml, filesize("path\myLocal.html"));
        fclose($myHtml); 
    }

    public function test_webScrapping_returns_html()
    {
        $scrapper = new WebScraping();
        $url = "www.the-url-I-wanna-scrape.com";
    
        $scrappedHtml= $scrapper->getHtml($url);
            
        $this->assertTrue($scrappedHtml=== $this->html);
            
    }

And the getHtml() function of my WebScraping model looks like this:

    public function getHtml(string $url)
    {
        $client = new Client(); //I know that I should not intantiate the Goutte Client here (inject in __constructor intead?)
        $html = $client->request('GET', $url)->html();
        
        return $html;
    }

The thing is that if I dd($this->html) or dd($scrappedHtml), the content is pretty much the same... with the only difference that one has \n and \r interpersed and the other hasn't. So... both htmls have the same stuff but I cannot assert that they're equal. What I'm missing??? Am I in the right path... or would you follow a totally different approach?

来源:https://stackoverflow.com/questions/63459024/how-to-unit-test-a-web-scraping-service-php-unit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!