pandas

Text similarity using Word2Vec

你说的曾经没有我的故事 提交于 2021-02-19 05:36:05
问题 I would like to use Word2Vec to check similarity of texts. I am currently using another logic: from fuzzywuzzy import fuzz def sim(name, dataset): matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1) return (name is my column). For applying this function I do the following: df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1) Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset? Example of dataset:

Grouping data by id, var1 into consecutive dates in python using pandas

心不动则不痛 提交于 2021-02-19 05:32:45
问题 I have some data that looks like: df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'], "val": [9, 2, 4, 7, 6, 3, 2], "dates": [pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 1, 2), pd.Timestamp(2020, 1, 2), pd.Timestamp(2020, 1, 3), pd.Timestamp(2020, 1, 5), pd.Timestamp(2020, 3, 12)]}) I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record

Grouping data by id, var1 into consecutive dates in python using pandas

与世无争的帅哥 提交于 2021-02-19 05:32:43
问题 I have some data that looks like: df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'], "val": [9, 2, 4, 7, 6, 3, 2], "dates": [pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 1, 2), pd.Timestamp(2020, 1, 2), pd.Timestamp(2020, 1, 3), pd.Timestamp(2020, 1, 5), pd.Timestamp(2020, 3, 12)]}) I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record

Filling missing middle values in pandas dataframe

我与影子孤独终老i 提交于 2021-02-19 05:25:06
问题 I have a pandas dataframe df as Date cost NC 20 5 NaN 21 7 NaN 23 9 78.0 25 6 80.0 Now what I need to do is fillup the missing dates and hence fill the column with a value say x only if there is number in the previous row. That is I want the output like Date cost NC 20 5 NaN 21 7 NaN 22 x NaN 23 9 78.0 24 x x 25 6 80.0 See Date 22 was missing and on 21 NC was missing, So on 22 cost is assigned to x but NC is assigned to NaN . Now setting the Date column to index and reindex ing it to missing

Python: Numpy and Pandas Transforming timestamp/data into one-hot-encoding

China☆狼群 提交于 2021-02-19 05:20:52
问题 I have a column of a dataframe that is like this time 0 2017-03-01 15:30:00 1 2017-03-01 16:00:00 2 2017-03-01 16:30:00 3 2017-03-01 17:00:00 4 2017-03-01 17:30:00 5 2017-03-01 18:00:00 6 2017-03-01 18:30:00 7 2017-03-01 19:00:00 8 2017-03-01 19:30:00 9 2017-03-01 20:00:00 10 2017-03-01 20:30:00 11 2017-03-01 21:00:00 12 2017-03-01 21:30:00 13 2017-03-01 22:00:00 . . . I want to "encode" the time of the day. I want to do this by firsly assigning each half an-hour a integer number. Starting

Pandas Dataframe nan values not replacing

痴心易碎 提交于 2021-02-19 05:20:22
问题 Trying to replace values in my data frame which are listed as 'nan' (note, not 'NaN') I've read in an excel file, then tried to replace the nan values like this: All_items_df = ALL_df[df_items].fillna(' ') Finally I get an output that still contains 'nan' All_items_df ['Colour'].head(10) Out[]: 7 nan 8 nan 9 nan 10 nan 13 nan 14 nan 15 nan 16 nan 18 nan 19 nan Name: Colour, dtype: object Checking the nan values using isna() or isnull().value.all() gives me False for the above values. Why is

How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?

本秂侑毒 提交于 2021-02-19 04:09:17
问题 I've got a table showing the accumulated number of hours ( dataframe values ) different specialists ( ID ) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4'] like this: Input: ID Task1 Task2 Task3 Task4 0 10 1 3 4 6 1 11 1 3 4 5 2 12 1 3 4 6 Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this: Desired output: value 1 3 4 5 6 ID 10

How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?

我与影子孤独终老i 提交于 2021-02-19 04:08:31
问题 I've got a table showing the accumulated number of hours ( dataframe values ) different specialists ( ID ) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4'] like this: Input: ID Task1 Task2 Task3 Task4 0 10 1 3 4 6 1 11 1 3 4 5 2 12 1 3 4 6 Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this: Desired output: value 1 3 4 5 6 ID 10

Faking whether an object is an Instance of a Class in Python

浪尽此生 提交于 2021-02-19 03:57:07
问题 Suppose I have a class FakePerson which imitates all the attributes and functionality of a base class RealPerson without extending it . In Python 3, is it possible to fake isinstance() in order to recognise FakePerson as a RealPerson object by only modifying the FakePerson class. For example: class RealPerson(): def __init__(self, age): self.age = age def are_you_real(self): return 'Yes, I can confirm I am a real person' def do_something(self): return 'I did something' # Complicated

Decode one-hot dataframe in Pandas

家住魔仙堡 提交于 2021-02-19 03:56:46
问题 I have 2 dataframes with the data as below: df1: ==== id name age likes --- ----- ---- ----- 0 A 21 rose 1 B 22 apple 2 C 30 grapes 4 D 21 lily df2: ==== category Fruit Flower --------- ------- ------- orange 1 0 apple 1 0 rose 0 1 lily 0 1 grapes 1 0 What I am trying to do is add another column to df1 which would contain the word 'Fruit' or 'Flower' depending on the one-hot encoding in df2 for that entry. I am looking for a purely pandas/numpy implementation. Any help would be appreciated.