问题
I am trying to grab all text from multiple tag from a given URL using scrapy .I am new to scrapy. I don't have much idea how to achieve this.Learning through examples and people experience on stackoverflow. Here is list of tags that i am targeting.
<div class="TabsMenu fl coloropa2 fontreg"><p>root div<p>
<a class="sub_h" id="mtongue" href="#">Mother tongue</a>
<a class="sub_h" id="caste" href="#">Caste</a>
<a class="sub_h" id="scases" href="#">My name is nand </a> </div>
<div class="BrowseContent fl">
<figure style="display: block;" class="mtongue_h">
<figcaption>
<div class="fullwidth clearfix pl10">Div string for test</div>
<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>
<div>
<select>
<option value="volvo">Volvo</option>
<option value="saab">Saab</option>
</select>
</div>
<li><a title="Hindi UP Matrimony" href="/hindi-up-matrimony-matrimonials"> Hindi-UP </a></li>
Expected outout would be
root div
Mother tongue
Caste
My name is nand
Div string for test
Coffee
Tea
Milk
Volvo
Saab
Hindi-UP
I was trying to get it through Xpath . here is spider code snap
def parse(self, response):
for sel in response.xpath('//body'):
lit = sel.xpath('//*[@id="tab_description"]/ul/li[descendant-or-self::text()]').extract()
print lit
string1 = ''.join(lit).encode('utf-8').strip('\r\t\n')
print string1
para=sel.xpath('//p/text()').extract()
span=sel.xpath('//span/text()').extract()
div=sel.xpath('//div/text()').extract()
strong=sel.xpath('//span/strong/text()').extract()
link=sel.xpath('//a/text()').extract()
string2 = ''.join(para).encode('utf-8').strip('\r\t\n')
string3 = ''.join(span).encode('utf-8').strip('\r\t\n')
string4 = ''.join(div).encode('utf-8').strip('\r\t\n')
string5 = ''.join(strong).encode('utf-8').strip('\r\t\n')
string6 = ''.join(link).encode('utf-8').strip('\r\t\n')
string=string6+string5+string4+string3+string2
print string
Code snap for Items
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
para=scrapy.Field()
strong=scrapy.Filed()
span=scrapy.Filed()
div=scrapy.Filed()
Here is output
BROWSE PROFILES BYMother tongueCasteReligionCityOccupationStateNRISpecial Cases Hindi-Delhi Marathi Hindi-UP Punjabi Telugu Bengali Tamil Gujarati Malayalam Kannada Hindi-MP Bihari RajasthaniOriyaKonkaniHimachaliHaryanviAssameseKashmiriSikkim/NepaliHindi Brahmin Sunni Kayastha Rajput Maratha Khatri Aggarwal Arora Kshatriya Shwetamber Yadav Sindhi Bania Scheduled CasteNairLingayatJatCatholic - RomanPatelDigamberSikh-JatGuptaCatholicTeliVishwakarmaBrahmin IyerVaishnavJaiswalGujjarSyrianAdi DravidaArya VysyaBalija NaiduBhandariBillavaAnavilGoswamiBrahmin HavyakaKumaoniMadhwaNagarSmarthaVaidikiViswaBuntChambharChaurasiaChettiarDevangaDhangarEzhavasGoudGowda Brahmin IyengarMarwariJatavKammaKapuKhandayatKoliKoshtiKunbiKurubaKushwahaLeva PatidarLohanaMaheshwariMahisyaMaliMauryaMenonMudaliarMudaliar ArcotMogaveeraNadarNaiduNambiarNepaliPadmashaliPatilPillaiPrajapatiReddySadgopeShimpiSomvanshiSonarSutarSwarnkarThevarThiyyaVaishVaishyaVanniyarVarshneyVeerashaivaVellalarVysyaGursikhRamgarhiaSainiMallahShahDhobi-KalarKambojKashmiri PanditRigvediVokkaligaBhavasar KshatriyaAgnikula Audichya Baidya Baishya Bhumihar Bohra Chamar Chasa Chaudhary Chhetri Dhiman Garhwali Gudia Havyaka Kammavar Karana Khandelwal Knanaya Kumbhar Mahajan Mukkulathor Pareek Sourashtra Tanti Thakur Vanjari Vokkaliga Daivadnya Kashyap Kutchi OBC Hindu Muslim Christian Sikh Jain Buddhist Parsi Jewish New Delhi Mumbai Bangalore Pune Hyderabad Kolkata Chennai Lucknow Ahmedabad Chandigarh Nagpur JaipurGurgaonBhopalNoidaIndorePatnaBhubaneshwarGhaziabadKanpurFaridabadLudhianaThaneAlabamaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict ColumbiaFloridaIndianaIowaKansasKentuckyMassachusettsMichiganMinnesotaMississippiNew JerseyNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaSouth CarolinaTennesseeTexasVirginiaWashingtonMangalorean IT Software Teacher CA/Accountant Businessman Doctors/Nurse Govt. Services Lawyers Defence IAS Maharashtra Uttar Pradesh
This code snap giving all text string but all text coming all together without space.It possible to get each and every phrase in new line and put space between word. is there any efficient way there so that using scrap.later i want to save them in a file.Can some one guide me using some code snap.
回答1:
@paultrmbrth suggested me this solution and it work for me
def parse_item(self,response):
with open(text, 'wb') as f:
f.write("".join(response.xpath('//body//*[not(self::script or self::style)]/text()').extract() ).encode('utf-8'))
item = DmozItem()
yield item
来源:https://stackoverflow.com/questions/37815366/how-to-get-plain-text-in-between-multiple-html-tag-using-scrapy