How to scrape a table and its links

蓝咒 提交于 2021-01-28 12:00:32

问题


What I want to do is to take thw following website

  • https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html
  • view-source:https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html

And pick the year of execution, enter the Last Statement Link, and retrieve the statement... perhaps I would be creating 2 dictionaries, both with the execution number as key.

Afterwards, I would classify the statements by length, besides "flagging" the refusals to give it or if it was just not given.

Finally, all would be compiled in a SQLite database, and I would display a graph that shows how many messages, clustered by type, have been given each year.

Beautiful Soup seems to be the path to follow, I'm already having troubles with just printing the year of execution... Of course, I'm not ultimately interested in printing the years of execution, but it seems like a good way of checking if at least my code is properly locating the tags I want.

tags = soup('td')
for tag in tags:
    print(tag.get('href', None))

Why does the previous code only print None?

Thanks beforehand.


回答1:


Use pandas to get and manipulate the table. The links are static and by that I mean they can be easily recreated with offender's first and last name.

Then, you can use requests and BeautifulSoup to scrape for offender's last statement, which are quite moving.

Here's how:

import requests
import pandas as pd

def clean(first_and_last_name: list) -> str:
    name = "".join(first_and_last_name).replace(" ", "").lower()
    return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")


base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")

df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(columns={'Link': "Offender Information", "Link.1": "Last Statement URL"}, inplace=True)

df["Offender Information"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)

df["Last Statement URL"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)

df.to_csv("offenders.csv", index=False)

This gets you:

EDIT:

I actually went ahead and added the code that fetches all offenders' last statements.

import random
import time

import pandas as pd
import requests
from lxml import html

base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
statement_xpath = '//*[@id="content_right"]/p[6]/text()'


def clean(first_and_last_name: list) -> str:
    name = "".join(first_and_last_name).replace(" ", "").lower()
    return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")


def get_last_statement(statement_url: str) -> str:
    page = requests.get(statement_url).text
    statement = html.fromstring(page).xpath(statement_xpath)
    text = next(iter(statement), "")
    return " ".join(text.split())


df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)

df.rename(
    columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
    inplace=True,
)

df["Offender Information"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)

df["Last Statement URL"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)

offender_data = list(
    zip(
        df["First Name"],
        df["Last Name"],
        df["Last Statement URL"],
    )
)

statements = []
for item in offender_data:
    *names, url = item
    print(f"Fetching statement for {' '.join(names)}...")
    statements.append(get_last_statement(statement_url=url))
    time.sleep(random.randint(1, 4))

df["Last Statement"] = statements
df.to_csv("offenders_data.csv", index=False)

This will take a couple of minutes because the code "sleeps" for anywhere between 1 to 4 seconds, more or less, so the server doesn't get abused.

Once this gets done, you'll end up with a .csv file with all offenders' data and their statements, if there was one.



来源:https://stackoverflow.com/questions/64868937/how-to-scrape-a-table-and-its-links

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!