I trained a binary offensive speech classifier on a stream of 2000 posts with some class imbalance (38% minority class) and obtained an F1 score = 1.0 but when it was tested