问题
I am trting to get the definition of words in spanish (like a dictionary) based on what the user inputs. The idea would be:
>>> hola
'1. interj. U. como salutación familiar.'
I first tried with urllib2, but since the definition appeared after the execution of JS (makes sense duh) it didn't work. I also tried selenium, but from what I understood it has to open a navigator window, right? I need it to be like urllib2, invisible.
If you want to try, the page where I search the definition is http://lema.rae.es/drae/?val=word where word is the word the user inputs.
Any thoughts, anyone?
回答1:
I might do it like alecxe suggested, but I'd use the URL that loads the definition itself. For instance, searching for azul
:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://lema.rae.es/drae/srv/search?val=azul')
print driver.find_element_by_css_selector("body>div").text
The URL that appears in the question loads a page that then loads the definition's URL in an iframe
element. Loading the definition directly with the URL I show above saves some work and some complexity: the entire definition is contained in the first div
child of body
. Unfortunately, it does not remove the need for JavaScript.
Running the code above produces:
azul.
(Quizá alterac. del ár. hisp. lazawárd, este del ár. lāzaward, este del persa laǧvard o lažvard, y este del sánscr. rājāvarta, rizo del rey).
1. adj. Del color del cielo sin nubes. Es el quinto color del espectro solar. U. t. c. s.
2. m. El cielo, el espacio. U. m. en leng. poét.
3. m. Méx. Miembro del cuerpo de Policía.
~ de cobalto.
[... etc ...]
Note that I've not detected the need to use any wait mechanism to detect that the content of the page is ready. Looking at the page in a debugger a) I did not see any Ajax request and b) looking at the JavaScript and the page itself, it looks like what is served is an obfuscated page that the JavaScript then deobfuscates synchronously. So by the time driver.get
returns, the content should be ready to be used.
回答2:
You can automate a headless PhantomJS browser through selenium
:
>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get('http://lema.rae.es/drae/?val=word')
>>>
>>> description = driver.find_element_by_css_selector('div.field-content p.azul')
>>> print description.text
El Diccionario de la lengua española (DRAE) es la obra de referencia de la Academia. La última edición es la 23.ª, publicada en octubre de 2014. Mientras se trabaja en la edición digital, que estará disponible próximamente, esta versión electrónica permite acceder al contenido de la 22.ª edición y las enmiendas incorporadas hasta 2012.
来源:https://stackoverflow.com/questions/29145054/load-web-page-in-python-after-javascripts-executes