subset

subset by at least two out of multiple conditions

人盡茶涼 提交于 2020-04-21 05:47:29
问题 I found many questions dealing with subsetting by multiple conditions, but just couldn't find how to subset by at least two out of >2 conditions. This SO question deals with the same problem, but applies the same condition to all columns: Select rows with at least two conditions from all conditions My question is: How can I subset rows by at least two out of three different conditions? id<-c(1,2,3,4,5) V1<-c(2,4,4,9,7) V2<-c(10,20,20,30,20) V3<-c(0.7,0.1,0.5,0.2,0.9) df<-data.frame(cbind(id

MODIS系列之NDVI(MOD13Q1)四:MRT单次及批次处理数据

╄→尐↘猪︶ㄣ 提交于 2020-04-18 09:05:24
前言: 本篇文章的出发点是因为之前接触过相关研究,困囧于该系列资料匮乏,想做一个系列。个人道行太浅,不足之处还请见谅。愿与诸君共勉。 数据准备: MODIS数据产品MOD13Q1—以2010年河南省3、4、5三个月为例: 一、MRT单次数据操作 (1).进入GUI界面操作 1.将所需处理的一个.hdf原始数据加载进来 2.通过左右选项选择所需波段(MOD13Q1已将NDVI提供,只需保留选择就行) (若用的数据比如MOD12Q1计算NDVI,则Modis算法如公式:NDVI=(Band2-Band1)/(Band2+Band1)。那么就需要将Band2和Band1波段提取出来) 3.Spatial Subset(空间子集):选择Input Lat/Long (输入纬度/经度) input line/sample (输入行/样本) output projection (X/Y输出投影X / Y) 4和5在一起 4.选择文件保存路径: 5.输出数据类型: 如我保存的路径( 必须同原始数据也就是.hdf在同一文件夹下 )就是F:\MODIS\.tif (注意,直接在MODIS文件夹后加 \.tif 就行。生成的.tif文件名直接等同于在同一个文件夹下.hdf文件同名(建议相同操作,通常数据多,命名易辨且重要)) 6.输出文件类型:GEOTIFF(我们要的就是.tif数据文件) 7

机器学习入门-数值特征-对数据进行log变化

允我心安 提交于 2020-04-16 08:30:53
【推荐阅读】微服务还能火多久?>>> 对于一些标签和特征来说,分布不一定符合正态分布,而在实际的运算过程中则需要数据能够符合正态分布 因此我们需要对特征进行log变化,使得数据在一定程度上可以符合正态分布 进行log变化,就是对数据使用np.log(data+1) 加上1的目的是为了防止数据等于0,而不能进行log变化 代码: 第一步:导入数据 第二步:对收入特征做直方图,同时标出中位数所在的位置,即均值 第三步:对收入特征做log变化,使用np.log(data+1) 第四步:对log收入特征做直方图,标出中位数线的位置,即均值 结论:我们可以发现变化后的特征在一定程度上更加接近正态分布 import pandas as pd import numpy as np import matplotlib.pyplot as plt # 第一步导入数据 ffc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', encoding='utf-8') # 第二步对数据的收入做直方图 fig, ax = plt.subplots() ffc_survey_df['Income'].hist(color='#A9C5D3', bins=30) plt.axvline(ffc_survey_df['Income

Python 常用笔记

送分小仙女□ 提交于 2020-04-14 13:30:09
【推荐阅读】微服务还能火多久?>>> 记录 http://blog.sina.com.cn/s/ blog_73b339390102yoio.html PE:市盈率 = 股价 / 每股盈利 PEG:(市盈率相对盈利增长比率 /市盈增长比率) PEG=PE/(企业年盈利增长率*100 ) PB:市净率 =股价 / 每股净资产 PS:市销率 =股价 / 每股收入=总市值 / 销售收入 ROE:净资产收益率=报告期净利润/报告期末净资产 EPS:每股盈余 =盈余 / 流通在外股数 beta值:(贝塔系数):每股收益 =期末净利润 / 期末总股本 import math 年均投资收益率 = (pow(终值 /本金, 1/年限) -1)*100 投资收益本息 = pow((1+预期年收益率),年限)*本金 投资目标年限 = math.log(终值/本金)/math.log(1+预期年收益率) 时间转换 import time a = ' 2020-03-06 19:18:00 ' a1 = time.strptime(a, ' %Y-%m-%d %H:%M:%S ' ) # 格式化str为time格式 print (time.strftime( ' %Y%m%d ' ,a1)) # 格式化time格式为str print (time.asctime(time.localtime(time

pyspark dataframe 去重

不羁的心 提交于 2020-04-10 13:12:13
pyspark dataframe 去重 两种去重,一种是整行每一项完全相同去除重复行,另一种是某一列相同去除重复行。 整行去重 dataframe1 = dataframe1.distinct() 某一列或者多列相同的去除重复 df = df.select("course_id", "user_id", "course_name") # 单列为标准 df1 = df.distinct.dropDuplicates(subset=[c for c in df.columns if c in ["course_id"]]) # 多列为标准 df2 = df.distinct.dropDuplicates(subset=[c for c in df.columns if c in ["course_id", "course_name"]]) 原文链接:https://blog.csdn.net/weixin_42864239/article/details/99672657 来源: oschina 链接: https://my.oschina.net/u/4342549/blog/3227705

数据分析:某地医院药品销售业务数据分析

不打扰是莪最后的温柔 提交于 2020-04-06 19:20:29
数据分析:某地医院药品销售业务数据分析 本篇文章以朝阳医院2018年销售数据为例,目的是了解朝阳医院在2018年里的销售情况几个业务指标 月均消费次数 月均消费金额 客单价 消费趋势 数据分析的步骤:提出问题→理解数据→数据清洗→构建模型→数据可视化 一.确定业务问题 我们知道,数据分析是指用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息和形成结论而对数据加以详细研究和概括总结的过程。 那么,与之对应的数据分析基本过程包括:获取数据、数据清洗、构建模型、数据可视化以及消费趋势等 二:数据概览 # 2018 年朝阳医院数据消费金额趋势图 import matplotlib.pyplot as plt from pandas import Series,DataFrame import pandas as pd import numpy as np fileNameStr= 'F:\\Downloads\ 朝阳医院2018 年销售数据.xlsx' xls=pd.ExcelFile(fileNameStr,dtype= 'object') salesDf = xls.parse( 'Sheet1',dtype= 'object') salesDf.info() 打印结果 < class 'pandas.core.frame.DataFrame'> Range Index:

filter rows when all columns greater than a value

和自甴很熟 提交于 2020-04-06 11:36:31
问题 I have a data frame and I would like to subset the rows where all columns values meet my cutoff. here is the data frame: A B C 1 1 3 5 2 4 3 5 3 2 1 2 What I would like to select is rows where all columns are greater than 2. Second row is what I want to get. [1] 4 3 5 here is my code: subset_data <- df[which(df[,c(1:ncol(df))] > 2),] But my code is not applied on all columns. Do you have any idea how can I fix this. 回答1: We can create a logical matrix my comparing the entire data frame with 2

filter rows when all columns greater than a value

て烟熏妆下的殇ゞ 提交于 2020-04-06 11:36:28
问题 I have a data frame and I would like to subset the rows where all columns values meet my cutoff. here is the data frame: A B C 1 1 3 5 2 4 3 5 3 2 1 2 What I would like to select is rows where all columns are greater than 2. Second row is what I want to get. [1] 4 3 5 here is my code: subset_data <- df[which(df[,c(1:ncol(df))] > 2),] But my code is not applied on all columns. Do you have any idea how can I fix this. 回答1: We can create a logical matrix my comparing the entire data frame with 2

R use paste function as a object in subset function

不羁岁月 提交于 2020-03-25 19:06:21
问题 I'm new in R. I've been reading a lot of forums but I can't find a solution, and I think it couldn't be as difficult. I want that R reads a data file and create a dataframe with all the data. Then, I want to create a new dataframe with a subset of the original once. For one data file it's easy, and the code I use is as follows (datainfo is a vector with the information of variables): var1 <- read.fwf("file_var1", widths = datainfo$lenght, col.names= datainfo$names) var1_5 <- subset(var1, ZONE

Subset a reactive dataframe in R

只谈情不闲聊 提交于 2020-03-24 00:46:13
问题 Hello I want to find the correlation coefficient of two columns of my dataset. If I use cor(subset(iris, select=c("Sepal.Length")),subset(iris, select=c("Sepal.Width"))) the correlation is being found but I cannot subset with my actual dataset which comes as a CSV file input and is in a reactive expression. cor(subset(rt(), select=c("Sepal.Length")),subset(rt(), select=c("Sepal.Width")))` So how could I subset a data frame of this reactive form? rt<-reactive({ req(input$file1) csvdata <- read