Performing image scrapping using YQL with lowest resources usage possible i.e. lowest number of queries

孤人 提交于 2019-12-24 01:36:07

问题


I am trying to perform some image scrapping tool which enables the user to scrap all the images contained within a given page using xpath process the scrapped images to find which have an alt tags and which doesn't and return the result as two separate json objects

i.e. {alted:["",""],nonAlted:["",""]}

now comes my problem, although i am able to scrap the page and retrieve all the images and separate them to the alted and nonAlted categories i can't put them in the response object !

I think to further clarify my issue it would be better to add some code, so the following code is what i use in the execute block of my YQL table:

query = "select * from html where url='http://www.mysite.com/page-path' and xpath='//li'";
var result = y.query(query);

y.log(result.results..img.(@alt));

var querieselement = <urls/>; 
querieselement.query = result.results..img.(@alt);

response.object = querieselement;

So my question is how can i set the response object to contain the processed list of the images, note that after running the query the result doesn't show any data although the log is showing the list, hope someone can point me to the cause of that problem.


P.S. The reason i mentioned "resources usage" in the title is that because i am aware of the ability to perform to separate calls for each images category which means scrapping the same page two times which i think is kind of inefficient.


P.S. i would also be glad if someone can help me understand what is the meaning of those two lines

querieselement = <urls/>;
querieselement.query = result.results..img.(@alt);

why "<urls/>" and why "querieselement.query", i don't know what they are supposed to do while they seem to be doing critical job as changing them breaks the code.

Thanks.


回答1:


So my question is how can i set the response object to contain the processed list of the images

Use a stylesheet rather than an XPath selector:

 select * from xslt where url="http://www.mysite.com/page-path" and stylesheet="http://www.mysite.com/page-path.xsl"

Define the stylesheet as such:

  <xsl:template match="img[@alt]">
    <xsl:for-each select="@alt">
      <script>
        alt.push(<xsl:value-of select="."/>);
      </script>
    </xsl:for-each>
  </xsl:template>

  <xsl:template match="img[not(@alt)]">
    <xsl:for-each select="@src">
      <script>
        noalt.push(<xsl:value-of select="."/>);
      </script>
    </xsl:for-each>
  </xsl:template>


来源:https://stackoverflow.com/questions/13461474/performing-image-scrapping-using-yql-with-lowest-resources-usage-possible-i-e-l

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!