extract elements from a html page

ぐ巨炮叔叔 提交于 2019-12-24 08:58:41

问题


I download some youtube comment page and I want to extract username(or user display name) and the link like from the following code block:

 <p class="metadata">
      <span class="author ">
        <a href="/channel/UCuoJ_C5xNTrdnc4motXPHIA" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKG174zFqbQCFZmaIQodtmyE0A%3D%3D" dir="ltr">Sabil Muhammad</a>
      </span>
        <span class="time" dir="ltr">
          <a dir="ltr" href="http://www.youtube.com/comment?lc=S2ZH2gSPYaef43vTRkLDxUzo2fYicVUc3SFvmYq2jrs">
            il y a 1 jour
          </a>
        </span>
    </p>

I want to extract /channel/UCuoJ_C5xNTrdnc4motXPHIA and Sabil Muhammad

there are of course many many lines in the html page, but I only want to focus on code blocks as the above and extract all usernames and corresponding links, and put them into a log file

are there any good scripts for this? I know bash and c/c++

thanks!


回答1:


You could use jQuery to accomplish something like this by iterating through all of the 'metadata' classes and pulling the contents that you need :

//After including jQuery within your page
$(document).ready(function()
{
    //Iterates through each of the metadata tags
    $('.metadata').each(function()
    {
          //Pulls the username
          var username = $('.yt-user-name', this).text();
          //Pulls the link
          var link = $('.time a', this).attr('href');
          //Process each accordingly
          alert(username + ':' + link);
    });
});

Working Example




回答2:


If you use jQuery, it's quite easy. However, if you're doing it in bash or c/c++ you'll need to retrieve the content of the page and parse for the elements you are interested in. You could treat the elements as XML and parse for attributes fairly easily.

You could use regex, or simple text matching with sub strings.




回答3:


with awk(if you are good in bash) you can read the page line by line and put a filter to catch <p class="metadata"> and start to copy and end copy if you face </p>

then work on that extracted part, and so on...



来源:https://stackoverflow.com/questions/13978021/extract-elements-from-a-html-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!