How to get plain text in between multiple html tag using scrapy

大城市里の小女人 提交于 2019-12-12 03:19:15

问题


I am trying to grab all text from multiple tag from a given URL using scrapy .I am new to scrapy. I don't have much idea how to achieve this.Learning through examples and people experience on stackoverflow. Here is list of tags that i am targeting.

<div class="TabsMenu fl coloropa2 fontreg"><p>root div<p>
<a class="sub_h" id="mtongue" href="#">Mother tongue</a>
<a class="sub_h" id="caste" href="#">Caste</a>

<a class="sub_h" id="scases" href="#">My name is nand </a> </div>
<div class="BrowseContent fl">
<figure style="display: block;" class="mtongue_h">
<figcaption>
<div class="fullwidth clearfix pl10">Div string for test</div>
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
<div>

<select>
  <option value="volvo">Volvo</option>
  <option value="saab">Saab</option>

</select>

</div>
<li><a title="Hindi UP Matrimony" href="/hindi-up-matrimony-matrimonials"> Hindi-UP </a></li>

Expected outout would be

root div
Mother tongue
Caste
My name is nand
Div string for test
Coffee
Tea
Milk
Volvo
Saab
Hindi-UP

I was trying to get it through Xpath . here is spider code snap

     def parse(self, response):
 for sel in response.xpath('//body'):

        lit = sel.xpath('//*[@id="tab_description"]/ul/li[descendant-or-self::text()]').extract()
        print lit
        string1 = ''.join(lit).encode('utf-8').strip('\r\t\n')
        print string1
        para=sel.xpath('//p/text()').extract()
        span=sel.xpath('//span/text()').extract()
        div=sel.xpath('//div/text()').extract()
        strong=sel.xpath('//span/strong/text()').extract()
        link=sel.xpath('//a/text()').extract()
        string2 = ''.join(para).encode('utf-8').strip('\r\t\n')
        string3 = ''.join(span).encode('utf-8').strip('\r\t\n')
        string4 = ''.join(div).encode('utf-8').strip('\r\t\n')
        string5 = ''.join(strong).encode('utf-8').strip('\r\t\n')
        string6 = ''.join(link).encode('utf-8').strip('\r\t\n')
        string=string6+string5+string4+string3+string2
        print string

Code snap for Items

class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
para=scrapy.Field()
strong=scrapy.Filed()
span=scrapy.Filed()
div=scrapy.Filed()

Here is output

BROWSE PROFILES BYMother tongueCasteReligionCityOccupationStateNRISpecial Cases Hindi-Delhi  Marathi  Hindi-UP  Punjabi  Telugu  Bengali  Tamil  Gujarati  Malayalam  Kannada  Hindi-MP  Bihari RajasthaniOriyaKonkaniHimachaliHaryanviAssameseKashmiriSikkim/NepaliHindi Brahmin  Sunni  Kayastha  Rajput  Maratha  Khatri  Aggarwal  Arora  Kshatriya  Shwetamber  Yadav  Sindhi  Bania Scheduled CasteNairLingayatJatCatholic - RomanPatelDigamberSikh-JatGuptaCatholicTeliVishwakarmaBrahmin IyerVaishnavJaiswalGujjarSyrianAdi DravidaArya VysyaBalija NaiduBhandariBillavaAnavilGoswamiBrahmin HavyakaKumaoniMadhwaNagarSmarthaVaidikiViswaBuntChambharChaurasiaChettiarDevangaDhangarEzhavasGoudGowda Brahmin IyengarMarwariJatavKammaKapuKhandayatKoliKoshtiKunbiKurubaKushwahaLeva PatidarLohanaMaheshwariMahisyaMaliMauryaMenonMudaliarMudaliar ArcotMogaveeraNadarNaiduNambiarNepaliPadmashaliPatilPillaiPrajapatiReddySadgopeShimpiSomvanshiSonarSutarSwarnkarThevarThiyyaVaishVaishyaVanniyarVarshneyVeerashaivaVellalarVysyaGursikhRamgarhiaSainiMallahShahDhobi-KalarKambojKashmiri PanditRigvediVokkaligaBhavasar KshatriyaAgnikula Audichya Baidya Baishya Bhumihar Bohra Chamar Chasa Chaudhary Chhetri Dhiman Garhwali Gudia Havyaka Kammavar Karana Khandelwal Knanaya Kumbhar Mahajan Mukkulathor Pareek Sourashtra Tanti Thakur Vanjari Vokkaliga Daivadnya Kashyap Kutchi OBC Hindu  Muslim  Christian  Sikh  Jain  Buddhist  Parsi  Jewish  New Delhi  Mumbai  Bangalore  Pune  Hyderabad  Kolkata  Chennai  Lucknow  Ahmedabad  Chandigarh  Nagpur JaipurGurgaonBhopalNoidaIndorePatnaBhubaneshwarGhaziabadKanpurFaridabadLudhianaThaneAlabamaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict ColumbiaFloridaIndianaIowaKansasKentuckyMassachusettsMichiganMinnesotaMississippiNew JerseyNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaSouth CarolinaTennesseeTexasVirginiaWashingtonMangalorean  IT Software  Teacher  CA/Accountant  Businessman  Doctors/Nurse  Govt. Services  Lawyers  Defence  IAS  Maharashtra  Uttar Pradesh 

This code snap giving all text string but all text coming all together without space.It possible to get each and every phrase in new line and put space between word. is there any efficient way there so that using scrap.later i want to save them in a file.Can some one guide me using some code snap.


回答1:


@paultrmbrth suggested me this solution and it work for me

def parse_item(self,response):


        with open(text, 'wb') as f:
            f.write("".join(response.xpath('//body//*[not(self::script or self::style)]/text()').extract() ).encode('utf-8'))

        item = DmozItem()
        yield item


来源:https://stackoverflow.com/questions/37815366/how-to-get-plain-text-in-between-multiple-html-tag-using-scrapy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!