Scrape title by only downloading relevant part of webpage

前端 未结 6 1713
深忆病人
深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答
  •  不思量自难忘°
    2021-02-05 11:29

    the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter </code> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:</p> <pre><code>import requests targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2" rangeheader = {"Range": "bytes=0-150"} response = requests.get(targeturl, headers=rangeheader) response.text </code></pre> <p>and you get</p> <pre><code>'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#' </code></pre> <p>now of course here's the problems with this what if you specify a range that is too short to get the title of the page? whats a good range to aim for? (combination of speed and assurance of accuracy) what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)</p> <p>i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.</p> <p>EDIT4:</p> <p>so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.</p> <pre><code>http://www.urbandictionary.com/nothing.php </code></pre> <p>the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.</p> <p>but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.</p> <pre><code>X5ijsuUJSoisjHJFk948.php </code></pre> <p>and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.</p> <p>heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.</p> <p>but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.</p> <p>EDIT5:</p> <p>so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:</p> <pre><code>response = requests.head('http://myshortenedurl.com/5b2su2') </code></pre> <p>replace my shortenedurl with tunyurl to follow along.</p> <pre><code>>>>response <Response [301]> </code></pre> <p>nice so we know this redirects to something.</p> <pre><code>>>>response.headers['Location'] 'http://stackoverflow.com' </code></pre> <p>now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.</p> <p>Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)</p> <pre><code>#!/usr/bin/python3 import requests from lxml.html import fromstring links = ['http://bit'ly/MW2qgH', 'http://bit'ly/1x0885j', 'http://bit'ly/IFHzvO', 'http://bit'ly/1PwR9xM'] for link in links: response = '<Response [301]>' redirect = '' while response == '<Response [301]>': response = requests.head(link) try: redirect = response.headers['Location'] except Exception as e: pass fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php' scrapetarget = requests.get(fakepage) tree = fromstring(scrapetarget.text) print(tree.findtext('.//title')) </code></pre> <p>so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:</p> <pre><code>Urban Dictionary error Page Not Found - Stack Overflow Error 404 (Not Found)!!1 Kijiji: Page Not Found </code></pre> <p>so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.</p> <p>Also credit to alecxe for this part of my quick and dirty script</p> <pre><code>tree = fromstring(scrapetarget.text) print(tree.findtext('.//title')) </code></pre> <p>for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:</p> <pre><code>rangeheader = {"Range": "bytes=0-500"} scrapetargetsection = requests.get(redirect, headers=rangeheader) tree = fromstring(scrapetargetsection.text) print(tree.findtext('.//title')) </code></pre> <p>output is:</p> <pre><code>None Stack Overflow Google Kijiji: Free Classifieds in... </code></pre> <p>here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.</p> </p> <div class="appendcontent"> </div> </div> <div class="jieda-reply"> <span class="jieda-zan button_agree" type="zan" data-id='2083951'> <i class="iconfont icon-zan"></i> <em>0</em> </span> <span type="reply" class="showpinglun" data-id="2083951"> <i class="iconfont icon-svgmoban53"></i> 讨论(0) </span> <div class="jieda-admin"> </div> <div class="noreplaytext bb"> <center><div> <a href="https://www.e-learn.cn/qa/q-1025229.html"> 查看其它6个回答 </a> </div></center> </div> </div> <div class="comments-mod " style="display: none; float:none;padding-top:10px;" id="comment_2083951"> <div class="areabox clearfix"> <form class="layui-form" action=""> <div class="layui-form-item"> <label class="layui-form-label" style="padding-left:0px;width:60px;">发布评论:</label> <div class="layui-input-block" style="margin-left:90px;"> <input type="text" placeholder="不少于5个字" AUTOCOMPLETE="off" class="comment-input layui-input" name="content" /> <input type='hidden' value='0' name='replyauthor' /> </div> <div class="mar-t10"><span class="fr layui-btn layui-btn-sm addhuidapinglun" data-id="2083951">提交评论 </span></div> </div> </form> </div> <hr> <ul class="my-comments-list nav"> <li class="loading"> <img src='https://www.e-learn.cn/qa/static/css/default/loading.gif' align='absmiddle' />  加载中... </li> </ul> </div> </li> </ul> <div class="layui-form layui-form-pane"> <form id="huidaform" name="answerForm" method="post"> <div class="layui-form-item layui-form-text"> <a name="comment"></a> <div class="layui-input-block"> <script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.config.js"></script> <script type="text/javascript" src="https://www.e-learn.cn/qa/static/js/neweditor/ueditor.all.js"></script> <script type="text/plain" id="editor" name="content" style="width:100%;height:200px;"></script> <script type="text/javascript"> var isueditor=1; var editor = UE.getEditor('editor',{ //这里可以选择自己需要的工具按钮名称,此处仅选择如下五个 toolbars:[['source','fullscreen', '|', 'undo', 'redo', '|', 'bold', 'italic', 'underline', 'fontborder', 'strikethrough', 'removeformat', 'formatmatch', 'autotypeset', 'blockquote', 'pasteplain', '|', 'forecolor', 'backcolor', 'insertorderedlist', 'insertunorderedlist', 'selectall', 'cleardoc', '|', 'rowspacingtop', 'rowspacingbottom', 'lineheight', '|', 'customstyle', 'paragraph', 'fontfamily', 'fontsize', '|', 'indent', '|', 'justifyleft', 'justifycenter', 'justifyright', 'justifyjustify', '|', 'link', 'unlink', 'anchor', '|', 'simpleupload', 'insertimage', 'scrawl', 'insertvideo', 'attachment', 'map', 'insertcode', '|', 'horizontal', '|', 'preview', 'searchreplace', 'drafts']], initialContent:'', //关闭字数统计 wordCount:false, zIndex:2, //关闭elementPath elementPathEnabled:false, //默认的编辑区域高度 initialFrameHeight:250 //更多其他参数,请参考ueditor.config.js中的配置项 //更多其他参数,请参考ueditor.config.js中的配置项 }); editor.ready(function() { editor.setDisabled(); }); $("#editor").find("*").css("max-width","362px"); </script> </div> </div> <div class="layui-form-item"> <label for="L_vercode" class="layui-form-label">验证码</label> <div class="layui-input-inline"> <input type="text" id="code" name="code" value="" required lay-verify="required" placeholder="图片验证码" autocomplete="off" class="layui-input"> </div> <div class="layui-form-mid"> <span style="color: #c00;"><img class="hand" src="https://www.e-learn.cn/qa/user/code.html" onclick="javascript:updatecode();" id="verifycode"><a class="changecode" href="javascript:updatecode();"> 看不清?</a></span> </div> </div> <div class="layui-form-item"> <input type="hidden" value="1025229" id="ans_qid" name="qid"> <input type="hidden" id="tokenkey" name="tokenkey" value=''/> <input type="hidden" value="Scrape title by only downloading relevant part of webpage" id="ans_title" name="title"> <div class="layui-btn layui-btn-disabled" id="ajaxsubmitasnwer" >提交回复</div> </div> </form> </div> </div> <input type="hidden" value="1025229" id="adopt_qid" name="qid" /> <input type="hidden" id="adopt_answer" value="0" name="aid" /> </div> <div class="layui-col-md4"> <!-- 热门讨论问题 --> <dl class="fly-panel fly-list-one"> <dt class="fly-panel-title">热议问题</dt> <!-- 本周热门讨论问题显示10条-->