Is there a way to extract IMDb reviews using IMDbPY?

十年热恋 提交于 2021-01-01 07:21:07

问题


I do not need the data-set, that's available in Kaggle . I want to extract a movie review from IMDb using IMDbPY or any other scraping method .

https://imdbpy.github.io/


回答1:


While it is not obvious from the imdbpy docs. You can always check the attributes of variable by checking the keys of the variables. Not all information that you are looking for is not immediately available when you scrape a movie using imdbpy. In your case you want to get the reviews. So you have to add them. We can see in the infoset, that there are three different types of reviews; 'reviews', 'external reviews', and 'critic reviews'. The keys that are associated with these are not added yet. The example below shows how it is done.

from imdb import IMDb

# create an instance of the IMDb class
ia = IMDb()

the_matrix = ia.get_movie('0133093')
print(sorted(the_matrix.keys()))

# show all information sets that can be fetched for a movie
print(ia.get_movie_infoset()) #Information we can add. Keys will be added
ia.update(the_matrix, ['external reviews'])
ia.update(the_matrix, ['reviews'])
ia.update(the_matrix, ['critic reviews'])
# show which keys were added by the information set
print(the_matrix.infoset2keys['external reviews']) #no external reviews, so no key is added
print(the_matrix.infoset2keys['reviews']) # A lot of reviews. Adds key: 'reviews'
print(the_matrix.infoset2keys['critic reviews']) #Adds the keys: 'metascore', and 'metacritic url'
# print(the_matrix['reviews'])
print(sorted(the_matrix.keys())) #Check out the new keys that we have added



回答2:


Yes, you can extract reviews using IMDbPY. Colab Notebook.

# to install the imdbpy library, just including it for noob-friendliness     
pip install imdbpy

Here is what you have to understand about IMDbPY, it retrieves data from IMDB for various objects such as movie, person and company using methods get_movie, get_person, and get_company respectively. However, the thing is there are a lot of information to retrieve and retrieving everything may not be the best solution (as it would be time and bandwidth consuming). So the data is grouped into small parts of information known as “information set”.

Code to retrieve the movie "The Matrix (1999)".
(Note: "0133093" is IMDb title's ID without the 'tt', example: https://www.imdb.com/title/tt0133093/)

from imdb import IMDb
ia = IMDb()
theMatrix = ia.get_movie('0133093')

By default, the movie object has the following information sets 'main', 'plot', 'synopsis', you can check that using .current_info. Now as we could see from this that the movie object by default doesn't retrieve the "reviews" information set.

theMatrix.current_info

#output:
['main', 'plot', 'synopsis']

We could pass an optional argument "info=" to the get_movie method if you know which information sets that you want to retrieve. In this case the 'reviews'.

theMatrix = ia.get_movie('0133093',['reviews'])
theMatrix.current_info

#output:
['reviews']

theMatrix['reviews']

#output:
[{'author': 'ur0540275',
  'content': "The story of a reluctant Christ-like protagonist...",
  'date': '19 September 2000',
  'helpful': 0,
  'not_helpful': 0,
  'rating': 1,
  'title': ''},
 {'author': 'ur15794099',
  'content': '** May contain spoilers **There aren\'t many movies...',
  'date': '26 July 2014',
...
...

If you have a movie object already retrieved and want to include a few more information sets without having to retrieve the entire movie object again, then the update method could help.

theMatrix = ia.get_movie('0133093')
theMatrix.current_info

#output
['main', 'plot', 'synopsis']

ia.update(theMatrix,['reviews'])
theMatrix.current_info

#output
['main', 'plot', 'synopsis', 'reviews']

Both ways detailed above can help you get not only "reviews" but any additional information sets that you want to retrieve. However, you need to know what are the available information sets that each object (movie, person or company) supports. For this, you could use the ia.get_movie_infoset, ia.get_person_infoset or ia.get_company_infoset methods respectively.

sorted(ia.get_movie_infoset())

#output:
['airing',
 'akas',
 'alternate versions',
 'awards',
 'connections',
 'crazy credits',
 'critic reviews',
 'episodes',
 'external reviews',
 ...
 ...
 'release dates',
 'release info',
 'reviews',
 'sound clips',
 'soundtrack',
 'synopsis',
 'taglines',
 'technical',
 'trivia',
 'tv schedule',
 'video clips',
 'vote details']

With all that theory to learn and understand imdbpy better. Here is the one-liner for getting the movie reviews :)

ia.get_movie_reviews('0133093')

#output:
[{'author': 'ur0540275',
  'content': "The story of a reluctant Christ-like protagonist...",
  'date': '19 September 2000',
  'helpful': 0,
  'not_helpful': 0,
  'rating': 1,
  'title': ''},
 {'author': 'ur15794099',
  'content': '** May contain spoilers **There aren\'t many movies...',
  'date': '26 July 2014',
...
...


来源:https://stackoverflow.com/questions/59969327/is-there-a-way-to-extract-imdb-reviews-using-imdbpy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!