问题
I am trying to classify given text into news, clickbait or others. The texts which I have for training are long.distribution of lengths is shown here. Now, the question is should I trim the text at the middle and make it 512 tokens long? But, I have even documents with circa 10,000 words so won't I loose the gist by truncation? Or, should I split my text into sub texts of 512 length. If so, then the sub text of one text may be similar to subtext of another text but the labels will be different. Doesn't it become noisy data? Or, should I just use bidirectional LSTM's here and pad to max_len?
回答1:
The answer to the similar question of yours can be found in the paper here.
Why do you think the chunks of the same document will have different labels if you're talking about classiffcation of texts as news or clickbaits? You can chunk the text and follow the idea of truncation approach proposed in How to Fine-Tune BERT for Text Classification?. The authors show that head+tail truncating delivers high accuracy. I used it several times thanks to the Github page and documentation and got good results.
You can choose the truncation method with a flag --trunc_medium with the options:
- -2 means head-only (keep the first 512 tokens),
- -1 means tail-only (keep the last 512 tokens),
- 0 means head-half + tail-half (e.g.: head256+tail256),
- other natural number k means head-k + tail-rest (e.g.: head-k + tail-(512-k)).
Then you may pool the results for the chunks creating the Pooled embeddings for the long texts you have.
Here I will also continue discussion about the state-of-the-art approaches for the classification of long texts with BERT reffering to Big BIRD (see the article). The researchers from Google build on the idea of Longformers and Extended Transformers Construction. Basically they propose combine the idea of Longformers and Randomized Attention that reduces quadratic dependency on the sequence length to linear. You can try even 10000-wording texts. The approach is interesting however, it requires architecture with more layers.
Plese check also the stackoverflow question.
来源:https://stackoverflow.com/questions/63671085/how-to-use-bert-for-long-sentences