问题
I have a data set containing bid and ask quotes for 3 days and a stock. Following is the portion of the dataset. I have also given a link to the sample data set to illustrate the pecularity of the matter.
> dput(head(q,30))
structure(list(Date = structure(c(1471424400, 1471424400, 1471424400,
1471424401, 1471424401, 1471424406, 1471424407, 1471424415, 1471424417,
1471424514, 1471424527, 1471424567, 1471424576, 1471424606, 1471424607,
1471424621, 1471424621, 1471424621, 1471424641, 1471424642, 1471424715,
1471424715, 1471424717, 1471424717, 1471424741, 1471424741, 1471424742,
1471424793, 1471424800, 1471424807), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Type = c("ASK", "BID", "ASK", "BID", "ASK",
"ASK", "BID", "BID", "BID", "ASK", "ASK", "BID", "BID", "BID",
"BID", "BID", "ASK", "BID", "BID", "BID", "BID", "BID", "BID",
"BID", "BID", "ASK", "BID", "BID", "BID", "BID"), Price = c(1749.95,
1611, 1683.9, 1653, 1672, 1683.9, 1653, 1654.2, 1663, 1682, 1663,
1664.75, 1663, 1664.75, 1663, 1664.75, 1662.9, 1663, 1664.75,
1663, 1664.75, 1663, 1664.75, 1663, 1664.75, 1662.9, 1663, 1664.75,
1663, 1664.75)), .Names = c("Date", "Type", "Price"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -30L))
Date Type Price
(time) (chr) (dbl)
1 2016-08-17 09:00:00 ASK 1749.95
2 2016-08-17 09:00:00 BID 1611.00
3 2016-08-17 09:00:00 ASK 1683.90
4 2016-08-17 09:00:01 BID 1653.00
5 2016-08-17 09:00:01 ASK 1672.00
6 2016-08-17 09:00:06 ASK 1683.90
7 2016-08-17 09:00:07 BID 1653.00
8 2016-08-17 09:00:15 BID 1654.20
9 2016-08-17 09:00:17 BID 1663.00
10 2016-08-17 09:01:54 ASK 1682.00
.. ... ... ...
It has multiple quotes for some of the time stamps. I want to estimate best bid and best ask out of these quotes using following steps:
Best bid process
- Estimate the highest bid for a time stamp.
- Compare it with the prevailing best bid quote.
- If it is greater than the prevailing best bid then it is the best bid, otherwise the prevailing best bid is the best bid.
Best ask process
Estimate the lowest ask for a time stamp.
Compare it with the prevailing best ask quote.
If it is lower than the prevailing best ask then it is the best ask, otherwise the prevailing best ask is the best bid. The appendices of this document contains SAS code for this purpose.
The Required Output structure:
The dataframe should have 3 columns (date, Best Ask, Best Bid). The rows should not have duplicate time stamps and So, both best bid and best ask should be in the same row for a particular time stamp rather than in two adjacent rows.
What I have tried
I don't know how to compare the minimum ask for the current time stamp with the minimum ask of previous time stamp. So, following code only gives minimum ask for a time stamp.
library(dplyr)
library(tidyr)
library(reshape2)
m.q<- head(q,30) %>%
gather(variable, value, -(Date:Type)) %>%
unite(temp,Type, variable) %>%
group_by(Date) %>% dcast(Date ~ temp, fun=min)
I get the following:
> dput(m.q)
structure(list(Date = structure(c(1471424400, 1471424401, 1471424406,
1471424407, 1471424415, 1471424417, 1471424514, 1471424527, 1471424567,
1471424576, 1471424606, 1471424607, 1471424621, 1471424641, 1471424642,
1471424715, 1471424717, 1471424741, 1471424742, 1471424793, 1471424800,
1471424807), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ASK_Price = c(1683.9, 1672, 1683.9, Inf, Inf, Inf, 1682,
1663, Inf, Inf, Inf, Inf, 1662.9, Inf, Inf, Inf, Inf, 1662.9,
Inf, Inf, Inf, Inf), BID_Price = c(1611, 1653, Inf, 1653,
1654.2, 1663, Inf, Inf, 1664.75, 1663, 1664.75, 1663, 1663,
1664.75, 1663, 1663, 1663, 1664.75, 1663, 1664.75, 1663,
1664.75)), .Names = c("Date", "ASK_Price", "BID_Price"), row.names = c(NA,
-22L), class = "data.frame")
Date ASK_Price BID_Price
1 2016-08-17 09:00:00 1683.9 1611.0
2 2016-08-17 09:00:01 1672.0 1653.0
3 2016-08-17 09:00:06 1683.9 Inf
4 2016-08-17 09:00:07 Inf 1653.0
5 2016-08-17 09:00:15 Inf 1654.2
6 2016-08-17 09:00:17 Inf 1663.0
[...]
Please note the following issues:
- Both bid and ask have values for some of the timestamps but they are shown in adjacent rows.
- The function min() is getting applied to bid becuase I do not how perform min only for ask and max only for bid simultaneously in the dcast function.
- Instead of "Inf" in the cells it should be NA in case there is no bid or ask value for the time stamp.
This is my data set.
This is question involves R coding. I do not know SAS and a beginner in R. (Thanks in advance for constructive comments and answers, feel free to improve my post.)
回答1:
This can be accomplished with the dplyr
and zoo
packages.
Use dplyr
to find either a minimum or maximum price per timestamp (depending on whether we are looking at BID or ASK values), then group by "Type" and arrange the summarized values by Date. From here, we use zoo::rollapply
to compute either a rolling minimum or rolling maximum, depending on whether the Type is ASK or BID.
library(dplyr)
library(zoo)
data.new <- group_by(data.stock, Date, Type) %>%
summarize(price = ifelse(Type[1] == 'ASK', min(Price), max(Price))) %>%
ungroup() %>%
group_by(Type) %>%
arrange(Type, Date) %>%
mutate(change = ifelse(Type == 'ASK',
rollapply(price, 1:length(price), min, align = 'right'),
rollapply(price, 1:length(price), max, align = 'right')
))
Output:
Date Type price change
<time> <chr> <dbl> <dbl>
1 2016-08-17 09:00:00 ASK 1683.9 1683.9
2 2016-08-17 09:00:01 ASK 1672.0 1672.0
3 2016-08-17 09:00:06 ASK 1683.9 1672.0
4 2016-08-17 09:01:54 ASK 1682.0 1672.0
5 2016-08-17 09:02:07 ASK 1663.0 1663.0
6 2016-08-17 09:03:41 ASK 1662.9 1662.9
7 2016-08-17 09:05:41 ASK 1662.9 1662.9
8 2016-08-17 09:00:00 BID 1611.0 1611.0
9 2016-08-17 09:00:01 BID 1653.0 1653.0
10 2016-08-17 09:00:07 BID 1653.0 1653.0
# ... with 16 more rows
The rolling minimum (or maximum) is thus preserved as we move down the timestamps. As an added bonus, this will never produce Inf
or NA
values.
Here is a line-by-line breakdown of what dplyr
is doing:
group_by(data.stock, Date, Type) %>%
This tells dplyr
to take the stock price data and cut it into subgroups according to combinations of Date and Type. We will then apply subsequent operations to these subgroups (see the "Split-Apply-Combine" philosophy). All subsequent operations are performed on the subgroups as if they are separate. The %>%
symbol "pipes" the output of the group_by
command. to the next command (see the magrittr package).
summarize(price = ifelse(Type[1] == 'ASK', min(Price), max(Price))) %>%
This will compute a summary for each unique combination of Date and Type (established in the group_by
line above). If Type is "ASK", we should calculate the minimum of the prices. Otherwise we calculate the maximum.
ungroup() %>%
group_by(Type) %>%
These lines tell dplyr
to reset the grouping variables, since we no longer want to group by individual Dates.
arrange(Type, Date) %>%
To be on the safe side, I arrange
the data frame by Type and Date, to ensure that all values are in chronological order.
mutate(change = ifelse(Type == 'ASK',
rollapply(price, 1:length(price), min, align = 'right'),
rollapply(price, 1:length(price), max, align = 'right')
))
Finally, the most complicated line. Here we are using dplyr::mutate
to create a column that stores the result of a rolling operation. If the Type is "ASK", we apply a rolling minimum to the price column. If it is not, we apply a rolling maximum. The result of the rolling min/max is stored in a new column called "change". The zoo
package makes these types of "rolling" operations much easier to perform.
来源:https://stackoverflow.com/questions/39086065/how-to-take-minimum-value-out-of-several-ask-quotes-and-maximum-value-out-of-sev