Creating a dataframe with text from a website

核能气质少年 提交于 2021-02-11 06:38:10

问题


I've been asked to create a data frame in R using information copied from a website; the data is not contained in a file. The full data list is at:

https://www.npr.org/2012/12/07/166400760/hollywood-heights-the-ups-downs-and-in-betweens

Here is a portion of the data:

Leading Men (Average American male: 5 feet 9.5 inches)

Dolph Lundgren — 6 feet 5 inches
John Cleese — 6 feet 5 inches
Michael Clarke Duncan — 6 feet 5 inches
Vince Vaughn — 6 feet 5 inches
Clint Eastwood — 6 feet 4 inches
Jimmy Stewart — 6 feet 3 inches
Bill Murray — 6 feet 1.5 inches

Leading Ladies (Average American female: 5 feet 4 inches)

Uma Thurman — 6 feet 0 inches
Brooke Shields — 6 feet 0 inches
Jane Lynch — 6 feet 0 inches

I am supposed to use R to create the data frame, where one column is Name, the second is Height (in cm), and the third column is Gender.

I have copied and pasted all data into Notepad, manually made three different columns, and converted height to cm by hand. But this is manually creating the data frame.

Is there a way to make a data frame in R using the data as given?


回答1:


You can copy that whole list and then use read.line to bring in the text on your clipboard into R. Then using regex you can extract the gender form the header of each section, expand it to the rows below, and then separate the first column to name and height. See below;

web.lines <- read.delim("clipboard", header = F) # reading data from clipboard

library(tidyverse)

web.lines %>% 
  mutate(gender = str_extract(V1, "Leading\\s+\\b(\\w+)\\b")) %>% # extracting gender from headers
  fill(gender , .direction = "down") %>% # filling the gender for all rows
  group_by(gender) %>% 
  slice(-1) %>% # removing the headers
  separate(V1, into = c("Name", "Height"), sep = " — ") # separating name and height


#> # A tibble: 59 x 3
#> # Groups:   gender [2]
#>    Name                  Height             gender        
#>    <chr>                 <chr>              <chr>         
#> 1  Uma Thurman           6 feet 0 inches    Leading Ladies
#> 2  Brooke Shields        6 feet 0 inches    Leading Ladies
#> 3  Jane Lynch            6 feet 0 inches    Leading Ladies
#> 4  Nicole Kidman         5 feet 11 inches   Leading Ladies
#> 5  Tilda Swinton         5 feet 10.5 inches Leading Ladies
#> ...
#> 28 Dolph Lundgren        6 feet 5 inches    Leading Men   
#> 29 John Cleese           6 feet 5 inches    Leading Men   
#> 30 Michael Clarke Duncan 6 feet 5 inches    Leading Men   
#> 31 Vince Vaughn          6 feet 5 inches    Leading Men   
#> 32 Clint Eastwood        6 feet 4 inches    Leading Men  
#> ...


来源:https://stackoverflow.com/questions/64376566/creating-a-dataframe-with-text-from-a-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!