问题
I have a dataframe containing trades with duplicated timestamps and buy and sell orders divided over several rows. In my example the total order amount is the sum over the same timestamp for that particular stock. I have created a simplified dataframe to show how the data looks like. I would like to end up with an dataframe with results from the trades and a trading ID for each trades. All trades are long positions, ie buy and try to sell at a higher price. The ID column for the desired output df2 is answered in this thread Create ID column in a pandas dataframe
import pandas as pd
from datetime import datetime
import numpy as np
string_date =['2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-01-01 08:00:00',
'2018-02-01 12:00:00',
]
data ={'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B'],
'deal': ['buy', 'buy', 'buy','buy','buy','sell','sell','sell','buy','buy','buy','sell','sell','sell','sell','sell','sell','buy'],
'amount':[1,2,3,4,10,8,1,1,3,2,5,2,2,6,3,3,4,5],
'price':[10,10,10,10,2,20,20,20,3,3,3,1,1,1,2,2,2,11]}
df = pd.DataFrame(data, index =string_date)
df
Out[245]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
One desired output:
string_date2 =['2018-01-01 01:00:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 04:00:00',
'2018-01-01 07:00:00',
'2018-01-01 08:00:00',
'2018-01-02 12:00:00',
]
data2 ={'stock': ['A','B', 'A', 'C', 'B','C','B'],
'deal': ['buy', 'buy','sell','buy','sell','sell','buy'],
'amount':[10,10,10,10,10,10,5],
'price':[10,2,20,3,1,2,11],
'ID': ['1', '2','1','3','2','3','4']
}
df2 = pd.DataFrame(data2, index =string_date2)
df2
Out[226]:
stock deal amount price ID
2018-01-01 01:00:00 A buy 10 10 1
2018-01-01 02:00:00 B buy 10 2 2
2018-01-01 03:00:00 A sell 10 20 1
2018-01-01 04:00:00 C buy 10 3 3
2018-01-01 07:00:00 B sell 10 1 2
2018-01-01 08:00:00 C sell 10 2 3
2018-01-02 12:00:00 B buy 5 11 4
Any ideas?
回答1:
This solution assumes a 'Long Only' portfolio where short sales are not allowed. Once a position is opened for a given stock, the transaction is assigned a new trade ID. Increasing the position in that stock results in the same trade ID, as well as any sell transactions reducing the size of the position (including the final sale where the position quantity is reduced to zero). A subsequent buy transaction in that same stock results in a new trade ID.
In order to maintain consistent trade identifiers with a growing log of transactions, I created a class TradeTracker
to track and assign trade identifiers for each transaction.
import numpy as np
import pandas as pd
# Create sample dataframe.
dates = [
'2018-01-01 01:00:00',
'2018-01-01 01:01:00',
'2018-01-01 01:02:00',
'2018-01-01 01:03:00',
'2018-01-01 02:00:00',
'2018-01-01 03:00:00',
'2018-01-01 03:01:00',
'2018-01-01 03:03:00',
'2018-01-01 04:00:00',
'2018-01-01 04:01:00',
'2018-01-01 04:02:00',
'2018-01-01 07:00:00',
'2018-01-01 07:01:00',
'2018-01-01 07:02:00',
'2018-01-01 08:00:00',
'2018-01-01 08:01:00',
'2018-01-01 08:02:00',
'2018-02-01 12:00:00',
'2018-03-01 12:00:00',
]
data = {
'stock': ['A','A','A','A','B','A','A','A','C','C','C','B','B','B','C','C','C','B','A'],
'deal': ['buy', 'buy', 'buy', 'buy', 'buy', 'sell', 'sell', 'sell', 'buy', 'buy', 'buy',
'sell', 'sell', 'sell', 'sell', 'sell', 'sell', 'buy', 'buy'],
'amount': [1, 2, 3, 4, 10, 8, 1, 1, 3, 2, 5, 2, 2, 6, 3, 3, 4, 5, 10],
'price': [10, 10, 10, 10, 2, 20, 20, 20, 3, 3, 3, 1, 1, 1, 2, 2, 2, 11, 15]
}
df = pd.DataFrame(data, index=pd.to_datetime(dates))
>>> df
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:01:00 A buy 2 10
2018-01-01 01:02:00 A buy 3 10
2018-01-01 01:03:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:01:00 A sell 1 20
2018-01-01 03:03:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:01:00 C buy 2 3
2018-01-01 04:02:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:01:00 B sell 2 1
2018-01-01 07:02:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:01:00 C sell 3 2
2018-01-01 08:02:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
2018-03-01 12:00:00 A buy 10 15
# Add `position` column representing the cumulative buys and sells for a given stock.
df['position'] = (
df
.assign(temp_amount=np.where(df['deal'].eq('buy'), df['amount'], -df['amount']))
.groupby(['stock'])['temp_amount']
.cumsum()
)
# Create a class to track trade identifiers and instantiate it.
class TradeTracker():
def __init__(self):
self.trade_counter = 0
self.trade_ids = {}
def get_trade_id(self, stock, position):
if position == 0:
trade_id = self.trade_ids.pop(stock)
elif stock not in self.trade_ids:
self.trade_counter += 1
self.trade_ids[stock] = trade_id = self.trade_counter
else:
trade_id = self.trade_ids[stock]
return trade_id
trade_tracker = TradeTracker()
# Add a `trade_id` column using our custom class in a list comprehension.
df['trade_id'] = [trade_tracker.get_trade_id(stock, position)
for stock, position in df[['stock', 'position']].to_numpy()]
>>> df
stock deal amount price position trade_id
2018-01-01 01:00:00 A buy 1 10 1 1
2018-01-01 01:01:00 A buy 2 10 3 1
2018-01-01 01:02:00 A buy 3 10 6 1
2018-01-01 01:03:00 A buy 4 10 10 1
2018-01-01 02:00:00 B buy 10 2 10 2
2018-01-01 03:00:00 A sell 8 20 2 1
2018-01-01 03:01:00 A sell 1 20 1 1
2018-01-01 03:03:00 A sell 1 20 0 1
2018-01-01 04:00:00 C buy 3 3 3 3
2018-01-01 04:01:00 C buy 2 3 5 3
2018-01-01 04:02:00 C buy 5 3 10 3
2018-01-01 07:00:00 B sell 2 1 8 2
2018-01-01 07:01:00 B sell 2 1 6 2
2018-01-01 07:02:00 B sell 6 1 0 2
2018-01-01 08:00:00 C sell 3 2 7 3
2018-01-01 08:01:00 C sell 3 2 4 3
2018-01-01 08:02:00 C sell 4 2 0 3
2018-02-01 12:00:00 B buy 5 11 5 4
2018-03-01 12:00:00 A buy 10 15 10 5
回答2:
Changed your string_date
to this:
In [2295]: string_date =['2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 01:00:00',
...: '2018-01-01 02:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 03:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 04:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 07:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-01-01 08:00:00',
...: '2018-02-01 12:00:00',
...: ]
...:
So df now is:
In [2297]: df
Out[2297]:
stock deal amount price
2018-01-01 01:00:00 A buy 1 10
2018-01-01 01:00:00 A buy 2 10
2018-01-01 01:00:00 A buy 3 10
2018-01-01 01:00:00 A buy 4 10
2018-01-01 02:00:00 B buy 10 2
2018-01-01 03:00:00 A sell 8 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 03:00:00 A sell 1 20
2018-01-01 04:00:00 C buy 3 3
2018-01-01 04:00:00 C buy 2 3
2018-01-01 04:00:00 C buy 5 3
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 2 1
2018-01-01 07:00:00 B sell 6 1
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 3 2
2018-01-01 08:00:00 C sell 4 2
2018-02-01 12:00:00 B buy 5 11
You can use Groupby.agg:
In [2302]: x = df.reset_index().groupby(['index', 'stock', 'deal'], as_index=False).agg({'amount': 'sum', 'price': 'max'}).set_index('index')
In [2303]: m = x['deal'] == 'buy'
In [2305]: x['ID'] = m.cumsum().where(m)
In [2307]: x['ID'] = x.groupby('stock')['ID'].ffill()
In [2308]: x
Out[2308]:
stock deal amount price ID
index
2018-01-01 01:00:00 A buy 10 10 1.0
2018-01-01 02:00:00 B buy 10 2 2.0
2018-01-01 03:00:00 A sell 10 20 1.0
2018-01-01 04:00:00 C buy 10 3 3.0
2018-01-01 07:00:00 B sell 10 1 2.0
2018-01-01 08:00:00 C sell 10 2 3.0
2018-02-01 12:00:00 B buy 5 11 4.0
来源:https://stackoverflow.com/questions/64872407/pandas-calculate-result-dataframe-from-a-dataframe-of-multiple-trades-at-same-ti