问题
Consider three N by 1 vectors describing N financial transactions: tickers
, dates
, and volumes
. The source for these vectors is a table like this:
Tickers Dates Volumes
------- ----- -------
TICKER1 1 200
TICKER1 1 400
TICKER1 2 100
TICKER2 1 300
... ... ...
The source table is sorted firstly by tickers, and secondly by dates.
I would like to consolidate (i.e. compute the sum of) all transactions that happened within a given day for a given company; meaning that the duplicates of all tickers and dates corresponding to transactions within one company within a given day are eliminated, while the volumes corresponding to these transactions are all added together and saved in the only remaining entry. The final output should look like this:
Tickers Dates Volumes
------- ----- -------
TICKER1 1 600
TICKER1 2 100
TICKER2 1 300
Note that the Dates
vector alone still contains non-unique entries because different companies (here TICKER1
and TICKER2
) can trade on the same day (here 1
); similarly, the Tickers
still contain non-unique entries because the same company (here TICKER1
) can trade on different days (here 1
and 2
. The kind of uniqueness I am looking to achieve is only defined with respect to the combined "key" of Tickers
and Dates
.
My idea so far has been to proceed as follows:
- Identify the coordinates of all coefficients in
volumes
for which the corresponding ticker and the corresponding date are non-unique. - Sum over all coefficients in
volume
that belong to this series of non-unique entries and save the sum as the first entry in the non-unique series.. - Delete all subsequent entries that belong to this non-unique series along with their corresponding entries in
dates
andtickers
.
So far I have been experimenting with [~,idx] = unique()
but without much success. This function returns only the coordinate of the first of any series of non-unique entries.
My question is two-fold: (1) Given my objective is the above "pseudocode" logically correct? If not, how would it have to be corrected in order to behave as desired? (2) How can this be implemented in MATLAB?
Note that I displayed the vectors as one table variable for easier presentation. I am working with three separate arrays and prefer the most low-level solution possible.
Any suggestions would be greatly appreciated!
回答1:
You can simply map your tickers to a number first by using a container.Map
. Then use the mapping to construct a matrix with your data. You can then use the unique combination of the ticker ID and the date to aggregate the sum. Finally you reconstruct a new table and remap the ticker IDs back into ticker names. The following code is heavily commented to guide you through the process.
You will need my super useful custom rows2cell.m function.
% Dummy Data
T = table({'a','a','a','b'}',[1 1 2 1]', [1 1 1 1]' , [1 1 1 1]'*10);
% Find unique ticker name
C = unique( table2cell( T(:,1)));
% Create map of ticker name to num
M = containers.Map( C, 1:length(C) );
I = 1:length(C);
% Transform Table to Array
F = [cellfun( @(x) M(x), table2cell( T(:,1) ) ) table2array( T(:,2:end) )];
% Find unique combinations of ticker/day
U = unique(F(:,1:2),'rows');
% Aggregate by ticker and date
T = array2table( cell2mat( cellfun(@(x) [x sum( F( F(:,1) == x(1) & F(:,2) == x(2), 3:4 ), 1 )], rows2cell( U ), 'UniformOutput', false ) ) );
% Remap number to ticker name
T.Var1 = C(table2array( T(:,1) ) );
Line 18 is as follows and this is the powerhouse of the script
T = array2table( cell2mat( cellfun(@(x) [x sum( F( F(:,1) == x(1) & F(:,2) == x(2), 3:4 ), 1 )], rows2cell( U ), 'UniformOutput', false ) ) );
We have the unique combo of ticker/day as cells using:
rows2cell( U )
In the cell, x(1)
is the ticker and x(2)
is the date. We want to run something that will aggregate on these two parameters. Assuming this form, we can get our logical mask using the following to obtain all the data that corresponds to this ticker/day combo.
F(:,1) == x(1) & F(:,2) == x(2)
Using this index, we can pull the 3rd and 4th columns using this:
F( F(:,1) == x(1) & F(:,2) == x(2), 3:4 )
Then sum them on the first direction (rows) using:
sum( F( F(:,1) == x(1) & F(:,2) == x(2), 3:4 ), 1 )
Since we want to construct the row of our new table by concatenating our input (ticker/day) and our data (col 3/4), we can use this anonymous function in cellfun:
@(x) [x sum( F( F(:,1) == x(1) & F(:,2) == x(2), 3:4 ), 1 )]
Since our cellfun will output a vector of cells representing our rows, we need to convert it to a matrix using cell2mat
and then from a matrix to a table using array2table
as follows:
array2table( cell2mat( ... ) ).
Edit:
Here's the result. Input Table:
Var1 Var2 Var3 Var4
____ ____ ____ ____
'a' 1 1 10
'a' 1 1 10
'a' 2 1 10
'b' 1 1 10
Output Table:
Var1 Var2 Var3 Var4
____ ____ ____ ____
'a' 1 2 20
'a' 2 1 10
'b' 1 1 10
来源:https://stackoverflow.com/questions/29761613/aggregate-duplicate-combinations-in-table-for-financial-ticker-and-dates-in-matl