Load web page in python AFTER JavaScripts executes

浪尽此生 提交于 2019-12-23 05:29:29

问题


I am trting to get the definition of words in spanish (like a dictionary) based on what the user inputs. The idea would be:

>>> hola
'1. interj. U. como salutación familiar.'

I first tried with urllib2, but since the definition appeared after the execution of JS (makes sense duh) it didn't work. I also tried selenium, but from what I understood it has to open a navigator window, right? I need it to be like urllib2, invisible.

If you want to try, the page where I search the definition is http://lema.rae.es/drae/?val=word where word is the word the user inputs.

Any thoughts, anyone?


回答1:


I might do it like alecxe suggested, but I'd use the URL that loads the definition itself. For instance, searching for azul:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://lema.rae.es/drae/srv/search?val=azul')
print driver.find_element_by_css_selector("body>div").text

The URL that appears in the question loads a page that then loads the definition's URL in an iframe element. Loading the definition directly with the URL I show above saves some work and some complexity: the entire definition is contained in the first div child of body. Unfortunately, it does not remove the need for JavaScript.

Running the code above produces:

azul.
(Quizá alterac. del ár. hisp. lazawárd, este del ár. lāzaward, este del persa laǧvard o lažvard, y este del sánscr. rājāvarta, rizo del rey).
1. adj. Del color del cielo sin nubes. Es el quinto color del espectro solar. U. t. c. s.
2. m. El cielo, el espacio. U. m. en leng. poét.
3. m. Méx. Miembro del cuerpo de Policía.
~ de cobalto.
[... etc ...]

Note that I've not detected the need to use any wait mechanism to detect that the content of the page is ready. Looking at the page in a debugger a) I did not see any Ajax request and b) looking at the JavaScript and the page itself, it looks like what is served is an obfuscated page that the JavaScript then deobfuscates synchronously. So by the time driver.get returns, the content should be ready to be used.




回答2:


You can automate a headless PhantomJS browser through selenium:

>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get('http://lema.rae.es/drae/?val=word')
>>>
>>> description = driver.find_element_by_css_selector('div.field-content p.azul')
>>> print description.text
El Diccionario de la lengua española (DRAE) es la obra de referencia de la Academia. La última edición es la 23.ª, publicada en octubre de 2014. Mientras se trabaja en la edición digital, que estará disponible próximamente, esta versión electrónica permite acceder al contenido de la 22.ª edición y las enmiendas incorporadas hasta 2012.


来源:https://stackoverflow.com/questions/29145054/load-web-page-in-python-after-javascripts-executes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!