I have some data that is sampled at at a very high rate (on the order of hundreds of times per second). This results in a sequence length that is huge (~90,000 samples) on avera
Three years later, we have what seems to be the start of solutions for this type of problem: sparse transformers.
See
https://arxiv.org/abs/1904.10509
https://openai.com/blog/sparse-transformer/