I have an approach worked for most of the images.
Removing lines helps because some e-papers keeps lines as article separator. We can achieve better results with more processing of the images. Heuristics like average width, average height, average area can be implemented on the contours left on the image after applying above steps to achieve better results.
Coming to the above question, the articles always with the white background. Without white background are clearly "Ads" or "pictures" or "miscellaneous" stuff. Removing pictures from the above 4 mentioned steps clears solves this issue.
PS: Choosing a value for RLSA horizontal and vertical is always mystery. Since the gap of the article varies from edition to edition.
Edit:
the above problem is basically applying Heuristics. Read through this
https://medium.com/@vasista/extract-title-from-the-image-documents-in-python-application-of-rlsa-58f91237901f