Parsing of PMCID table row to column form

一个人想着一个人 提交于 2021-02-10 14:17:12

问题


dput(t1)
structure(list(PMCID = c("PMC7809753", "PMC7809753", "PMC7809753", 
"PMC7809753", "PMC7809753", "PMC7790830", "PMC7790830", "PMC7790830", 
"PMC7790830", "PMC7790830"), table = c("Table 1", "Table 1", 
"Table 1", "Table 1", "Table 1", "Table 1", "Table 1", "Table 1", 
"Table 1", "Table 1"), row = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
4L, 5L), text = c("Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK. Inactivation: CDA, dCMPD, PN-I.; Efflux=MRP4,7,8; Refs.=[14, 30–33, 78–80]", 
"Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[44, 51, 81–84]", 
"Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44, 85–90]", 
"Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, 91, 92]", 
"Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutamylation); Efflux=P-gp, MRP1-5, BCRP; Refs.=[16, 93, 94]", 
"Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; Cell count(×109/l): PLT=9; BM Blast (%)=70.5; Karyotype=46,XX,t(8,21)(q22;q22)", 
"Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103; Cell count(×109/l): PLT=62; BM Blast (%)=60.4; Karyotype=46,XX", 
"Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; Cell count(×109/l): PLT=100; BM Blast (%)=88; Karyotype=45,XY,-7", 
"Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; Cell count(×109/l): PLT=52; BM Blast (%)=86.8; Karyotype=46,XY", 
"Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; Cell count(×109/l): PLT=197; BM Blast (%)=32.4; Karyotype=46,XX"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

The above one is my sample data frame which looks like this

head(t1)
# A tibble: 6 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table…     4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…
6 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …

For example this paper PMC7809753 paper whose output is above. In paper the First table is "Properties of the chemotherapeutic drugs used in AML" looks like this. In my data frame the Table 1 of PMC7809753 ID is repeated 5 times which corresponds to the above pic i have attached.

Now the The issue is how do i parse each table of particular PMCID into a tabular or column like structure as shown in the paper.

UPDATE Based on my PMCID I can split each of the row into a list.

aa <- split(t1, f = t1$PMCID) 

which gives me this

$PMC7790830
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7790830 Table…     1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …
2 PMC7790830 Table…     2 Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103…
3 PMC7790830 Table…     3 Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; …
4 PMC7790830 Table…     4 Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; C…
5 PMC7790830 Table…     5 Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; …

$PMC7809753
# A tibble: 5 x 4
  PMCID      table    row text                                                                                                                
  <chr>      <chr>  <int> <chr>                                                                                                               
1 PMC7809753 Table…     1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table…     2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table…     3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table…     4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table…     5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…

UPDATE v2

I tried to segregate the same PMCID rows into one based on the below solution.

Convert duplicate rows to separate columns in R

library(splitstackshape)
library(data.table)
DT <- setDT(t1)[, do.call(paste, c(.SD, list(collapse=', '))) , PMCID]
DT1 <- cSplit(DT, 'V1', sep='[ ,]+', fixed=FALSE, stripWhite=TRUE)
setnames(DT1, 2:ncol(DT1), rep(names(t1)[-1], 41))
DT1

So still the problem remains as above how do i separate and segregate those rows corresponding to the list into column or some tabular form as shown in the pic.


回答1:


I think it may be helpful to use tidypmc package with your europepmc output. Here is an example of extracting the first table from your PMC article using pmc_table. This also uses map from purrr in tidyverse.

library(tidypmc)
library(tidyverse)
library(europepmc)

doc <- map("PMC7809753", epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]

Output

# A tibble: 7 x 6
  Drug                Target           Influx            Metabolisma                                 Efflux         Refs.        
  <chr>               <chr>            <chr>             <chr>                                       <chr>          <chr>        
1 Cytarabine (Ara-C)  DNA polymerases  ENT1, CNT3, OCTN1 "Activation: dCK, dCMPK, NDK. Inactivation… MRP4,7,8       [14, 30–33, …
2 Daunorubicin (DNR)  DNA, Topoisomer… Passive diffusion ""                                          P-gp, MRP1,7,… [44, 51, 81–…
3 Mitoxantrone (MX)   DNA, Topoisomer… Passive diffusion ""                                          P-gp, MRP1, B… [44, 85–90]  
4 Etoposide (VP-16)   Topoisomerase II Passive diffusion ""                                          P-gp, MRP1-3,… [16, 91, 92] 
5 Methotrexate (MTX)  DHFR, TS, AICAR… RFC, PCFT         "Aldehyde oxidase, FPGS (polyglutamylation… P-gp, MRP1-5,… [16, 93, 94] 
6 Venetoclax (VEN)    Bcl-2            Passive diffusion ""                                          P-gp           [72, 95]     
7 Gemtuzumab Ozogami… DNA              Ab-mediated endo… "Lysosomal Calicheamicin cleavage from Ab,… P-gp, MRP1     [73, 77]     

Edit (1/30/21): To automate this process for multiple articles (and based on your other question and approach), consider the following.

You can have a vector containing your pmcids, and use that with map. This will create docs containing all the xml for all the pmcids articles.

Then you can use map again to store all the tables in my_tables, which would be a list.

b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 6)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
docs <- map(pmcids, epmc_ftxt)
my_tables <- map(docs, pmc_table)

You can then access, for example, article 2 table 1 by:

my_tables[[2]][[1]]

Edit (1/31/21): To set the names of each article to the PMCID, you can use set_names, and chain using %>% with map. set_names will add names to your vector. When you call this function, but don't provide additional names, it will use the vector elements as the names. For example:

docs <- pmcids %>%
  set_names() %>%
  map(., epmc_ftxt)

You can call separately my_tables <- map(docs, pmc_table) afterwards, or even add this to the chain (storing the whole thing as my_tables) if only interested in tables, and not the full documents.

Ultimately, you could then access individual tables using the PMCID like this:

my_tables[["PMC7806552"]][[1]]


来源:https://stackoverflow.com/questions/65952017/parsing-of-pmcid-table-row-to-column-form

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!