data-processing | 易学教程

Looking for a more efficient way to reorganize a massive CSV in Python

阅读更多关于 Looking for a more efficient way to reorganize a massive CSV in Python

问题 I've been working on a problem where I have data from a large output .txt file, and now have to parse and reorganize certain values in the the form of a .csv. I've already written a script that input all the data into a .csv in columns based on what kind of data it is (Flight ID, Latitude, Longitude, etc), but it's not in the correct order. All values are meant to be grouped based on the same Flight ID, in order from earliest time stamp to the latest. Fortunately, my .csv has all values in

How to fill 0 and NaN for a dataframe after groupby it in python?

阅读更多关于 How to fill 0 and NaN for a dataframe after groupby it in python?

问题 I have a dataframe looks like this after I groupby them. gb = new_user_log.groupby(['msno', 'date', 'num_25', 'num_50', 'num_75', 'num_985', 'num_100', 'num_unq', 'total_secs', 'days']).days.count() Part of the dataframe shows here: msno date num_25 num_50 num_75 num_985 num_100 num_unq total_secs days ++orpnUqSevh2M5A97pRRiONA58g5m9DwaNrhD44HY0= 2016-08-14 78 13 3 3 77 84 18987.862 2 1 +0krxpTkQT7hciN95OEp7i2lyKvbXft887VNQGF6xN4= 2016-12-22 27 23 5 2 11 65 5946.577 35 1 98 1 +1eAmfPiXsMG0J+U

Regular expression extracting number dimension

阅读更多关于 Regular expression extracting number dimension

问题 I'm using python regular expressions to extract dimensional information from a database. The entries in that column look like this: 23 cm 43 1/2 cm 20cm 15 cm x 30 cm What I need from this is only the width of the entry (so for the entries with an 'x', only the first number), but as you can see the values are all over the place. From what I understood in the documentation, you can access the groups in a match using their position, so I was thinking I could determine the type of the entry

How to log PHP file run with jQuery AJAX in browser console?

阅读更多关于 How to log PHP file run with jQuery AJAX in browser console?

问题 I have a PHP file that does data processing. It's run by $.ajax() , sometimes on big files that take a long time to process. I need to log some info about the ongoing process in the browser console that is displayed on the go , not just when the PHP file has finished running. From the research I did, I get that there are two difficulties: Getting PHP to spit out something before it's done Getting jQuery/JS to display it on the go To address #1, I've tried: echo "started"."<br>"; foreach

what is “file_like_object”, what is “file”; pickle.load() and pickle.loads()

阅读更多关于 what is “file_like_object”, what is “file”; pickle.load() and pickle.loads()

问题 I am figuring out the differences between the pickle.load() and pickle.loads() . Somebody said what kind of object that pickle.load() process is "file_like_object", however, pickle.loads() corresponds to "file object". 回答1: Your choice of which function to use depends on the object from whence you are loading the pickled data: pickle.loads is used to load pickled data from a bytes string. The "s" in loads refers to the fact that in Python 2, the data was loaded from a string . For example:

Python Multiprocessing Loop

阅读更多关于 Python Multiprocessing Loop

问题 I'm hoping to use multiprocessing to speed up a sluggish loop. However, from what I've seen of multiprocessing examples, I'm not sure if this sort of implementation is good practice, feasible or possible. There are broadly two parts to the loop: data ingestion and data processing . I would like to have the next part of data ingestion starting while processing is going on, so the data is available as soon as possible. Pseudo code: d = get_data(n) for n in range(N): p = process_data(d) d = get

How should I Handle duplicate times in time series data with pandas?

阅读更多关于 How should I Handle duplicate times in time series data with pandas?

问题 I have the following returned from an API Call as part of a larger dataset: {'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052600'} {'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052500'} Ideally I would use the timestamp as an index on the pandas data frame however this appears to fail as there is a duplicate when converting to JSON: df = df.set_index(pd.to_datetime(df['Timestamp'])) print(new_df.to_json(orient='index'))

What is the difference between mini-batch vs real time streaming in practice (not theory)?

阅读更多关于 What is the difference between mini-batch vs real time streaming in practice (not theory)?

问题 What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batch with epsilon time frame (say one millisecond) or I would like to understand reason why one would be an effective solution than other? I recently came across one example where mini-batch (Apache

Pandas Dataframe selecting groups with minimal cardinality

阅读更多关于 Pandas Dataframe selecting groups with minimal cardinality

问题 I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row. I am using the code below train = train[train.groupby('id').id.transform(len) > headRows] groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index() This works. But the first line, it is very slow :(. 30 minutes or more. Is there any way to make the

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

阅读更多关于 Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

问题 This question already has answers here : Only read selected columns (3 answers) Closed 5 years ago . I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file. The only options I know of are read.table which is very wasteful when I only want a couple of columns or scan which seems too low level for what I want. Is there a better option, either with pure R or perhaps calling out to some