subset | 易学教程

Subset variables by significant P value

阅读更多关于 Subset variables by significant P value

问题 I'm trying to subset variables by significant P-values, and I attempted with the following code, but it only selects all variables instead of selecting by condition. Could anyone help me to correct the problem? myvars <- names(summary(backward_lm)$coefficients[,4] < 0.05) happiness_reduced <- happiness_nomis[myvars] Thanks! 回答1: An alternative solution to Martin's great answer (in the comments section) using the broom package. Unfortunately, you haven't posted an data, so I'm using the mtcars

Check if a vector is a superset of another vector in R

阅读更多关于 Check if a vector is a superset of another vector in R

问题 I have the following list of vectors: a <- c(1,2,4,5,6,7,8,9) b <- c(1,2,4,5) c <- c(1,2,3,10,11,12,13,14) d <- c(1,2,3,10,15,16,17,18,19) e <- c(1,2,3,10,15,16) f <- list(a,b,c,d,e) Right now, I can do something like this is_subset <- vector() for(i in 1:length(f)) { is_subset <- c(is_subset, all(unlist(f[i]) %in% unlist(f[-i]))) } f[!is_subset] and get a list containing every vector that is not a subset of any other vector from the original list: [[1]] [1] 1 2 4 5 6 7 8 9 [[2]] [1] 1 2 3 10

Dynamic subset condition in R

阅读更多关于 Dynamic subset condition in R

问题 I'm trying to implement a function which takes a dynamic subset based on a list of column names of any length The static code is: s <- c("s0","s1","s2") d.subset <- d[ d$s0 > 0 | d$s1 > 0 | d$s2 > 0,] However, I want to generate the d$s0 > 0 | d$s1 > 0 | d$s2 > 0 part based on s. I tried as.formula() for generating it, but it gave me an "invalid formula" error. 回答1: An example data frame: d <- data.frame(s0 = c(0,1,0,0), s1 = c(1,1,1,0), s2 = c(0,1,1,0)) s <- c("s0","s1","s2") Here is an easy

extract attributes from pandas columns that satisfy a condition

阅读更多关于 extract attributes from pandas columns that satisfy a condition

问题 Let's say I have a table of frequencies of 3 different variables: M1, M2 and M3, over different instances: P1, ... P4: tupl = [(0.7, 0.2, 0.1), (0,0,1), (0.2,0.6,0.2), (0.6,0.4,0)] df_test = pd.DataFrame(tupl, columns = ["M1", "M2", "M3"], index =["P1", "P2", "P3", "P4"]) Now for each row, I want to be able to extract as a string, the occurrence of each variable, such that the final output would be something like: output = pd.DataFrame([("M1+M2+M3"), ("M3"), ("M1+M2+M3"), ("M1+M2")], columns

API 分页探讨：offset 来分页真的有效率？

阅读更多关于 API 分页探讨：offset 来分页真的有效率？

对于设计和实现 API 来说，当结果集包含成千上万条记录时，返回一个查询的所有结果可能是一个挑战，它给服务器、客户端和网络带来了不必要的压力，于是就有了分页的功能。通常我们通过一个 offset 偏移量或者页码来进行分页，然后通过 API 实现类似请求： GET /api/products? page =10 { "items" : [ .. .100 products]} 如果要继续访问后续数据，则修改分页参数即可。 GET /api/products? page =11 { "items" : [ .. .another 100 products]} 在使用 offset 的情况下，通常使用 ?offset=1000 和 ?offset=1100 这种大家都熟悉的方法。它要么直接调用 OFFSET 1000 LIMIT 100 的 SQL 查询数据库，要么使用 LIMIT 乘以 page 作为查询参数。无论如何，「这是一个次优的解决方案」，因为无论哪种数据库都要跳过前面 offset 指定的 1000 行。而跳过额外的offset，不管是 PostgreSQL，ElasticSearch还是 MongoDB 都存在额外开销，数据库需要对它们进行排序，计数，然后将前面不用的数据扔掉。这是一种低效的方法，但由于它使用简单，所以大家重复地用这个方法，也就是直接把 API

《大秦赋》最近很火！于是我用Python抓取了“相关数据”，发现了这些秘密......

阅读更多关于《大秦赋》最近很火！于是我用Python抓取了“相关数据”，发现了这些秘密......

↑ 关注 + 星标，每天学Python新技能后台回复【大礼包】送你Python自学大礼包前言最近，最火的电视剧莫过于《大秦赋了》，自12月1日开播后，收获了不错的口碑。然而随着电视剧的跟新，该剧在网上引起了激烈的讨论，不仅口碑急剧下滑，颇有高开低走的趋势，同时该剧的评分也由最初的8.9分，下降到了现在的6.5分。虽然我还没有看过这个新剧，但是对于小伙伴们讨论的内容，却颇有兴趣（主要还是大家老是讨论这个剧）。因此，我用Python爬取了《大秦赋》的相关数据，进行了一波分析。数据爬取巧妇难为无米之炊，做数据分析之前最重要的就是 “数据获取” 。于是，我准备用Python爬取豆瓣上的短评数据以及一些评论时间信息、评价星级信息。关于数据的爬取主要说以下几个内容： 1）关于翻页操作第一页： https://movie.douban.com/subject/ 26413293 /comments?status=P 第二页： https://movie.douban.com/subject/ 26413293 /comments?start= 20 &limit= 20 &status=P&sort=new_score 第三页： https://movie.douban.com/subject/ 26413293 /comments?start=

Make a table showing the 10 largest values of a variable in R?

阅读更多关于 Make a table showing the 10 largest values of a variable in R?

问题 I want to make a simple table that showcases the largest 10 values for a given variable in my dataset, as well as 4 other variables for each observation, so basically a small subset of my data. It would look something like this: Score District Age Group Gender 17 B 23 Red 1 12 A 61 Red 0 11.7 A 18 Blue 0 10 B 18 Red 0 . . etc. whereby the data is ordered on the Score var. All the data is contained within the same dataframe. 回答1: This should do it... data <- data[with(data,order(-Score)),]

Make a table showing the 10 largest values of a variable in R?

阅读更多关于 Make a table showing the 10 largest values of a variable in R?

All ways to partition a string

阅读更多关于 All ways to partition a string

问题 I'm trying to find a efficient algorithm to get all ways to partition a string eg for a given string 'abcd' => 'a' 'bcd' 'a' 'b' 'cd' 'a' 'b' 'c' 'd' 'ab' 'cd' 'ab' 'c' 'd' 'abc' 'd' 'a', 'bc', 'd any language would be appreciated Thanks in advance ! 回答1: Problem analysis Between each pair of adjacent characters, you can decide whether to cut. For a string of size n , there are n-1 positions where you can cut or not, i.e. there are two possibilities. Therefore there are 2^(n-1) partitions for

R语言学习笔记之十

阅读更多关于 R语言学习笔记之十

摘要: 仅用于记录R语言学习过程：内容提要：描述性统计；t检验；数据转换；方差分析；卡方检验；回归分析与模型诊断；生存分析；COX回归写在正文前的话，关于基础知识，此篇为终结篇，笔记来自医学方的课程，仅用于学习R的过程。正文：描述性统计 n 如何去生成table1 用 table()函数，快速汇总频数 u 生成四格表：table(行名，列名) > table(tips$sex,tips$smoker) No Yes Female 54 33 Male 97 60 u addmargins()函数：对生成的table表格进行计算 > table(esoph$agegp,esoph$ncases) 0 1 2 3 4 5 6 8 9 17 25-34 14 1 0 0 0 0 0 0 0 0 35-44 10 2 2 1 0 0 0 0 0 0 45-54 3 2 2 2 3 2 2 0 0 0 55-64 0 0 2 4 3 2 2 1 2 0 65-74 1 4 2 2 2 2 1 0 0 1 75+ 1 7 3 0 0 0 0 0 0 0 > tt <- table(esoph$agegp,esoph$ncases) > addmargins(tt,margin = c(1,2)) # margin 1表示行，2表示列 0 1 2 3 4 5 6 8 9 17

订阅 subset