data-processing

How to smooth a curve in the right way?

折月煮酒 提交于 2019-12-17 00:19:24
问题 Lets assume we have a dataset which might be given approximately by import numpy as np x = np.linspace(0,2*np.pi,100) y = np.sin(x) + np.random.random(100) * 0.2 Therefore we have a variation of 20% of the dataset. My first idea was to use the UnivariateSpline function of scipy, but the problem is that this does not consider the small noise in a good way. If you consider the frequencies, the background is much smaller than the signal, so a spline only of the cutoff might be an idea, but that

Summing up the total based on the random number of inputs of a column

你说的曾经没有我的故事 提交于 2019-12-13 22:37:49
问题 I need to sum up the "value" column amount for each value of col1 of the File1 and export it to an output file. I'm new in python and need to do it for thousands of records. File1 col1 col2 value 559 1 91987224 2400000000 559 0 91987224 100000000 558 0 91987224 100000000 557 2 87978332 500000000 557 1 59966218 2400000000 557 0 64064811 100000000 Desired Output: col1 Sum 559 2500000000 558 1000000000 557 3000000000 Thanks in advance. P.S : I can't use the pandas library due to permission

Is there a faster way to update dataframe column values based on conditions?

怎甘沉沦 提交于 2019-12-12 19:25:27
问题 I am trying to process a dataframe. This includes creating new columns and updating their values based on the values in other columns. More concretely, I have a predefined "source" that I want to classify. This source can fall under three different categories 'source_dtp', 'source_dtot', and 'source_cash'. I want to add three new columns to the dataframe that are comprised of either 1's or 0's based on the original "source" column. I am currently able to do this, it's just really slow ...

How can I merge two csv files by a common column, in the case of unequal rows?

删除回忆录丶 提交于 2019-12-11 09:59:03
问题 I have a set of 100 files. 50 files containing census information for each US state. The other fifty are geographic data that need to be merged with the correct file for each state. For each state, the census file and its corresponding geo file are related by a common variable, LOGRECNO, that is the 10th column in the census file and the 7th column in the geo file. The problem is that the geo file has more rows than the census file; my census data does not cover certain subsets of geographic

Remove quotes holding 2 words and remove comma between them

北战南征 提交于 2019-12-11 08:12:54
问题 Following up on Python to replace a symbol between between 2 words in a quote Extended input and expected output: trying to replace comma between 2 words Durango and PC in the second line by & and then remove the quotes " as well. Same for third line with Orbis and PC and 4th line has 2 word combos in quotes that I would like to process "AAA - Character Tech, SOF - UPIs","Durango, Orbis, PC" I would like to retain the rest of the lines using Python. INPUT 2,SIN-Rendering,Core Tech - Rendering

Apache NiFi: Add column to csv using mapped values

别来无恙 提交于 2019-12-11 07:13:26
问题 A csv is brought into the NiFi Workflow using a GetFile Processor. I have a column consisting of a "id". Each id means a certain string. There are around 3 id's. For an example if my csv consists of name,age,id John,10,Y Jake,55,N Finn,23,C I am aware that Y means York, N means Old and C means Cat. I want a new column with a header named "nick" and have the corresponding nick for each id. name,age,id,nick John,10,Y,York Jake,55,N,Old Finn,23,C,Cat Finally I want a csv with the extra column

Most efficient way to use a large data set for PyTorch?

泄露秘密 提交于 2019-12-11 02:45:28
问题 Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation. I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my approaches are ad-hoc and just terribly inefficient. May times I can go back through my code and clean things up later because the inefficiency is not so drastic that performance is significantly affected. However, in this case, my method for using the

Excel: Send multiple values in “Command text”

Deadly 提交于 2019-12-11 02:37:09
问题 Located in the "Data > Connections > Properties > Definition (tab) > Command text", I have the following: {Call SP_calculo_algo(?)} Where currently the function receives only one value through the unique parameter that it has, that according with someone told me it is represented by the character of the question mark (?). What I need is to send two (2) values through the function, since I have the SQL query that returns data that refer to a range between two dates. For example: Start Date

How to smooth a curve with large noise which is only in certain part?

会有一股神秘感。 提交于 2019-12-10 15:21:56
问题 I'd like to smooth a scatter plot shown below (the points are very dense), and the data is here. There is large noise in the middle of the curve, and I'd like to smooth the curve, also the y value should monotonically increase . Since there are lots of curves like this, it is kind of hard to know where the noise is in the curve. I tried scipy.signal.savgol_filter , but it didn't work. The code I used is: from scipy.signal import savgol_filter from scipy import interpolate import numpy as np

How to gracefully fallback to `NaN` value while reading integers from a CSV with Pandas?

白昼怎懂夜的黑 提交于 2019-12-10 04:22:33
问题 While using read_csv with Pandas, if i want a given column to be converted to a type, a malformed value will interrupt the whole operation, without an indication about the offending value. For example, running something like: import pandas as pd import numpy as np df = pd.read_csv('my.csv', dtype={ 'my_column': np.int64 }) Will lead to a stack trace ending with the error: ValueError: cannot safely convert passed user dtype of <i8 for object dtyped data in column ... If i had the row number,