问题
dput(t1)
structure(list(PMCID = c("PMC7809753", "PMC7809753", "PMC7809753",
"PMC7809753", "PMC7809753", "PMC7790830", "PMC7790830", "PMC7790830",
"PMC7790830", "PMC7790830"), table = c("Table 1", "Table 1",
"Table 1", "Table 1", "Table 1", "Table 1", "Table 1", "Table 1",
"Table 1", "Table 1"), row = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L), text = c("Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK. Inactivation: CDA, dCMPD, PN-I.; Efflux=MRP4,7,8; Refs.=[14, 30–33, 78–80]",
"Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[44, 51, 81–84]",
"Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44, 85–90]",
"Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, 91, 92]",
"Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutamylation); Efflux=P-gp, MRP1-5, BCRP; Refs.=[16, 93, 94]",
"Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; Cell count(×109/l): PLT=9; BM Blast (%)=70.5; Karyotype=46,XX,t(8,21)(q22;q22)",
"Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103; Cell count(×109/l): PLT=62; BM Blast (%)=60.4; Karyotype=46,XX",
"Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; Cell count(×109/l): PLT=100; BM Blast (%)=88; Karyotype=45,XY,-7",
"Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; Cell count(×109/l): PLT=52; BM Blast (%)=86.8; Karyotype=46,XY",
"Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; Cell count(×109/l): PLT=197; BM Blast (%)=32.4; Karyotype=46,XX"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
The above one is my sample data frame which looks like this
head(t1)
# A tibble: 6 x 4
PMCID table row text
<chr> <chr> <int> <chr>
1 PMC7809753 Table… 1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table… 2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table… 3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table… 4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table… 5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…
6 PMC7790830 Table… 1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …
For example this paper PMC7809753 paper whose output is above. In paper the First table is "Properties of the chemotherapeutic drugs used in AML" looks like this. In my data frame the Table 1 of PMC7809753 ID is repeated 5 times which corresponds to the above pic i have attached.
Now the The issue is how do i parse each table of particular PMCID into a tabular or column like structure as shown in the paper.
UPDATE Based on my PMCID I can split each of the row into a list.
aa <- split(t1, f = t1$PMCID)
which gives me this
$PMC7790830
# A tibble: 5 x 4
PMCID table row text
<chr> <chr> <int> <chr>
1 PMC7790830 Table… 1 Patients no.=1; Age (years)=45; Gender=M; FAB subtype=M2; Cell count(×109/l): WBC=30.1; Cell count(×109/l): HB=87; …
2 PMC7790830 Table… 2 Patients no.=2; Age (years)=41; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=14.58; Cell count(×109/l): HB=103…
3 PMC7790830 Table… 3 Patients no.=3; Age (years)=49; Gender=M; FAB subtype=M4; Cell count(×109/l): WBC=4.84; Cell count(×109/l): HB=69; …
4 PMC7790830 Table… 4 Patients no.=4; Age (years)=65; Gender=M; FAB subtype=M5; Cell count(×109/l): WBC=220; Cell count(×109/l): HB=85; C…
5 PMC7790830 Table… 5 Patients no.=5; Age (years)=61; Gender=F; FAB subtype=M5; Cell count(×109/l): WBC=4.61; Cell count(×109/l): HB=71; …
$PMC7809753
# A tibble: 5 x 4
PMCID table row text
<chr> <chr> <int> <chr>
1 PMC7809753 Table… 1 Drug=Cytarabine (Ara-C); Target=DNA polymerases; Influx=ENT1, CNT3, OCTN1; Metabolisma=Activation: dCK, dCMPK, NDK.…
2 PMC7809753 Table… 2 Drug=Daunorubicin (DNR); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1,7, BCRP; Refs.=[…
3 PMC7809753 Table… 3 Drug=Mitoxantrone (MX); Target=DNA, Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1, BCRP; Refs.=[44,…
4 PMC7809753 Table… 4 Drug=Etoposide (VP-16); Target=Topoisomerase II; Influx=Passive diffusion; Efflux=P-gp, MRP1-3,6, BCRP; Refs.=[16, …
5 PMC7809753 Table… 5 Drug=Methotrexate (MTX); Target=DHFR, TS, AICARFT; Influx=RFC, PCFT; Metabolisma=Aldehyde oxidase, FPGS (polyglutam…
UPDATE v2
I tried to segregate the same PMCID rows into one based on the below solution.
Convert duplicate rows to separate columns in R
library(splitstackshape)
library(data.table)
DT <- setDT(t1)[, do.call(paste, c(.SD, list(collapse=', '))) , PMCID]
DT1 <- cSplit(DT, 'V1', sep='[ ,]+', fixed=FALSE, stripWhite=TRUE)
setnames(DT1, 2:ncol(DT1), rep(names(t1)[-1], 41))
DT1
So still the problem remains as above how do i separate and segregate those rows corresponding to the list into column or some tabular form as shown in the pic.
回答1:
I think it may be helpful to use tidypmc
package with your europepmc
output. Here is an example of extracting the first table from your PMC article using pmc_table
. This also uses map
from purrr
in tidyverse
.
library(tidypmc)
library(tidyverse)
library(europepmc)
doc <- map("PMC7809753", epmc_ftxt)
tbls <- pmc_table(doc[[1]])
tbls[[1]]
Output
# A tibble: 7 x 6
Drug Target Influx Metabolisma Efflux Refs.
<chr> <chr> <chr> <chr> <chr> <chr>
1 Cytarabine (Ara-C) DNA polymerases ENT1, CNT3, OCTN1 "Activation: dCK, dCMPK, NDK. Inactivation… MRP4,7,8 [14, 30–33, …
2 Daunorubicin (DNR) DNA, Topoisomer… Passive diffusion "" P-gp, MRP1,7,… [44, 51, 81–…
3 Mitoxantrone (MX) DNA, Topoisomer… Passive diffusion "" P-gp, MRP1, B… [44, 85–90]
4 Etoposide (VP-16) Topoisomerase II Passive diffusion "" P-gp, MRP1-3,… [16, 91, 92]
5 Methotrexate (MTX) DHFR, TS, AICAR… RFC, PCFT "Aldehyde oxidase, FPGS (polyglutamylation… P-gp, MRP1-5,… [16, 93, 94]
6 Venetoclax (VEN) Bcl-2 Passive diffusion "" P-gp [72, 95]
7 Gemtuzumab Ozogami… DNA Ab-mediated endo… "Lysosomal Calicheamicin cleavage from Ab,… P-gp, MRP1 [73, 77]
Edit (1/30/21): To automate this process for multiple articles (and based on your other question and approach), consider the following.
You can have a vector containing your pmcids
, and use that with map
. This will create docs
containing all the xml for all the pmcids
articles.
Then you can use map
again to store all the tables in my_tables
, which would be a list.
b <-epmc_search(query = 'cytarabine aml OPEN_ACCESS:Y',limit = 6)
pmcids <- b$pmcid[b$isOpenAccess=="Y"]
docs <- map(pmcids, epmc_ftxt)
my_tables <- map(docs, pmc_table)
You can then access, for example, article 2 table 1 by:
my_tables[[2]][[1]]
Edit (1/31/21): To set the names of each article to the PMCID, you can use set_names
, and chain using %>%
with map
. set_names
will add names to your vector. When you call this function, but don't provide additional names, it will use the vector elements as the names. For example:
docs <- pmcids %>%
set_names() %>%
map(., epmc_ftxt)
You can call separately my_tables <- map(docs, pmc_table)
afterwards, or even add this to the chain (storing the whole thing as my_tables
) if only interested in tables, and not the full documents.
Ultimately, you could then access individual tables using the PMCID like this:
my_tables[["PMC7806552"]][[1]]
来源:https://stackoverflow.com/questions/65952017/parsing-of-pmcid-table-row-to-column-form