subset

Subset variables by significant P value

帅比萌擦擦* 提交于 2021-01-28 01:32:28
问题 I'm trying to subset variables by significant P-values, and I attempted with the following code, but it only selects all variables instead of selecting by condition. Could anyone help me to correct the problem? myvars <- names(summary(backward_lm)$coefficients[,4] < 0.05) happiness_reduced <- happiness_nomis[myvars] Thanks! 回答1: An alternative solution to Martin's great answer (in the comments section) using the broom package. Unfortunately, you haven't posted an data, so I'm using the mtcars

Check if a vector is a superset of another vector in R

ぃ、小莉子 提交于 2021-01-27 20:06:18
问题 I have the following list of vectors: a <- c(1,2,4,5,6,7,8,9) b <- c(1,2,4,5) c <- c(1,2,3,10,11,12,13,14) d <- c(1,2,3,10,15,16,17,18,19) e <- c(1,2,3,10,15,16) f <- list(a,b,c,d,e) Right now, I can do something like this is_subset <- vector() for(i in 1:length(f)) { is_subset <- c(is_subset, all(unlist(f[i]) %in% unlist(f[-i]))) } f[!is_subset] and get a list containing every vector that is not a subset of any other vector from the original list: [[1]] [1] 1 2 4 5 6 7 8 9 [[2]] [1] 1 2 3 10

Dynamic subset condition in R

青春壹個敷衍的年華 提交于 2021-01-27 19:21:59
问题 I'm trying to implement a function which takes a dynamic subset based on a list of column names of any length The static code is: s <- c("s0","s1","s2") d.subset <- d[ d$s0 > 0 | d$s1 > 0 | d$s2 > 0,] However, I want to generate the d$s0 > 0 | d$s1 > 0 | d$s2 > 0 part based on s. I tried as.formula() for generating it, but it gave me an "invalid formula" error. 回答1: An example data frame: d <- data.frame(s0 = c(0,1,0,0), s1 = c(1,1,1,0), s2 = c(0,1,1,0)) s <- c("s0","s1","s2") Here is an easy

extract attributes from pandas columns that satisfy a condition

女生的网名这么多〃 提交于 2021-01-27 13:14:27
问题 Let's say I have a table of frequencies of 3 different variables: M1, M2 and M3, over different instances: P1, ... P4: tupl = [(0.7, 0.2, 0.1), (0,0,1), (0.2,0.6,0.2), (0.6,0.4,0)] df_test = pd.DataFrame(tupl, columns = ["M1", "M2", "M3"], index =["P1", "P2", "P3", "P4"]) Now for each row, I want to be able to extract as a string, the occurrence of each variable, such that the final output would be something like: output = pd.DataFrame([("M1+M2+M3"), ("M3"), ("M1+M2+M3"), ("M1+M2")], columns

API 分页探讨:offset 来分页真的有效率?

浪尽此生 提交于 2021-01-25 13:46:55
对于设计和实现 API 来说,当结果集包含成千上万条记录时,返回一个查询的所有结果可能是一个挑战,它给服务器、客户端和网络带来了不必要的压力,于是就有了分页的功能。 通常我们通过一个 offset 偏移量或者页码来进行分页,然后通过 API 实现类似请求: GET /api/products? page =10 { "items" : [ .. .100 products]} 如果要继续访问后续数据,则修改分页参数即可。 GET /api/products? page =11 { "items" : [ .. .another 100 products]} 在使用 offset 的情况下,通常使用 ?offset=1000 和 ?offset=1100 这种大家都熟悉的方法。它要么直接调用 OFFSET 1000 LIMIT 100 的 SQL 查询数据库,要么使用 LIMIT 乘以 page 作为查询参数。 无论如何, 「这是一个次优的解决方案」 ,因为无论哪种数据库都要跳过前面 offset 指定的 1000 行。而跳过额外的offset,不管是 PostgreSQL,ElasticSearch还是 MongoDB 都存在额外开销,数据库需要对它们进行排序,计数,然后将前面不用的数据扔掉。 这是一种低效的方法,但由于它使用简单,所以大家重复地用这个方法,也就是直接把 API

《大秦赋》最近很火!于是我用Python抓取了“相关数据”,发现了这些秘密......

给你一囗甜甜゛ 提交于 2021-01-24 13:05:47
↑ 关注 + 星标 ,每天学Python新技能 后台回复【 大礼包 】送你Python自学大礼包 前言 最近,最火的电视剧莫过于《大秦赋了》,自12月1日开播后,收获了不错的口碑。然而随着电视剧的跟新,该剧在网上引起了 激烈的讨论 ,不仅口碑急剧下滑,颇有 高开低走的趋势 ,同时该剧的评分也由最初的8.9分,下降到了现在的6.5分。 虽然我还没有看过这个新剧,但是对于小伙伴们讨论的内容,却颇有兴趣(主要还是大家老是讨 论这个剧)。因此, 我用Python爬取了《大秦赋》的相关数据 ,进行了一波分析。 数据爬取 巧妇难为无米之炊,做数据分析之前最重要的就是 “数据获取” 。于是,我准备用Python爬取豆瓣上的 短评数据 以及一些 评论时间信息 、 评价星级信息 。 关于数据的爬取主要说以下几个内容: 1) 关于翻页操作 第一页: https://movie.douban.com/subject/ 26413293 /comments?status=P 第二页: https://movie.douban.com/subject/ 26413293 /comments?start= 20 &limit= 20 &status=P&sort=new_score 第三页: https://movie.douban.com/subject/ 26413293 /comments?start=

Make a table showing the 10 largest values of a variable in R?

心已入冬 提交于 2021-01-20 16:54:02
问题 I want to make a simple table that showcases the largest 10 values for a given variable in my dataset, as well as 4 other variables for each observation, so basically a small subset of my data. It would look something like this: Score District Age Group Gender 17 B 23 Red 1 12 A 61 Red 0 11.7 A 18 Blue 0 10 B 18 Red 0 . . etc. whereby the data is ordered on the Score var. All the data is contained within the same dataframe. 回答1: This should do it... data <- data[with(data,order(-Score)),]

Make a table showing the 10 largest values of a variable in R?

喜欢而已 提交于 2021-01-20 16:53:12
问题 I want to make a simple table that showcases the largest 10 values for a given variable in my dataset, as well as 4 other variables for each observation, so basically a small subset of my data. It would look something like this: Score District Age Group Gender 17 B 23 Red 1 12 A 61 Red 0 11.7 A 18 Blue 0 10 B 18 Red 0 . . etc. whereby the data is ordered on the Score var. All the data is contained within the same dataframe. 回答1: This should do it... data <- data[with(data,order(-Score)),]

All ways to partition a string

半腔热情 提交于 2021-01-17 11:11:20
问题 I'm trying to find a efficient algorithm to get all ways to partition a string eg for a given string 'abcd' => 'a' 'bcd' 'a' 'b' 'cd' 'a' 'b' 'c' 'd' 'ab' 'cd' 'ab' 'c' 'd' 'abc' 'd' 'a', 'bc', 'd any language would be appreciated Thanks in advance ! 回答1: Problem analysis Between each pair of adjacent characters, you can decide whether to cut. For a string of size n , there are n-1 positions where you can cut or not, i.e. there are two possibilities. Therefore there are 2^(n-1) partitions for

R语言学习笔记之十

家住魔仙堡 提交于 2021-01-06 07:23:35
摘要: 仅用于记录R语言学习过程: 内容提要: 描述性统计;t检验;数据转换;方差分析;卡方检验;回归分析与模型诊断;生存分析;COX回归 写在正文前的话,关于基础知识,此篇为终结篇,笔记来自医学方的课程,仅用于学习R的过程。 正文: 描述性统计 n 如何去生成table1 用 table()函数 ,快速汇总频数 u 生成四格表:table(行名,列名) > table(tips$sex,tips$smoker) No Yes Female 54 33 Male 97 60 u addmargins()函数 :对生成的table表格进行计算 > table(esoph$agegp,esoph$ncases) 0 1 2 3 4 5 6 8 9 17 25-34 14 1 0 0 0 0 0 0 0 0 35-44 10 2 2 1 0 0 0 0 0 0 45-54 3 2 2 2 3 2 2 0 0 0 55-64 0 0 2 4 3 2 2 1 2 0 65-74 1 4 2 2 2 2 1 0 0 1 75+ 1 7 3 0 0 0 0 0 0 0 > tt <- table(esoph$agegp,esoph$ncases) > addmargins(tt,margin = c(1,2)) # margin 1表示行,2表示列 0 1 2 3 4 5 6 8 9 17