I am working with a subset of the Signal Media One-Million News Articles Dataset, in tsv format with the first column being the labels and the second the text data (cleaned pre-