How to remove all portrait pictures from a document

I am working on OCRing a document image. I want to detect all pictures and remove from the document image. I want to retain tables in the document image. Once I detect pictures I will remove and then want to OCR. I tried to find contour tried to detect all the bigger areas. unfortunately it detects tables also. Also how to remove the objects keeping other data in the doc image. I am using opencv and python

This is my document Image

Here's my code

import os
from PIL import Image
import pytesseract

img = cv2.imread('block2.jpg' , 0)
mask = np.ones(img.shape[:2], dtype="uint8") * 255

ret,thresh1 = cv2.threshold(img,127,255,0)
contours, sd = cv2.findContours(thresh1,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)

areacontainer = []

for cnt in contours:
    area = cv2.contourArea(cnt)

avgArea = sum(areacontainer)/len(areacontainer)

for c in contours:# average area heuristics
    if cv2.contourArea(c)>6*avgArea:
        cv2.drawContours(mask, [c], -1, 0, -1)

binary = cv2.bitwise_and(img, img, mask=mask) # subtracting
cv2.imwrite("bin.jpg" , binary)
cv2.imwrite("mask.jpg" , mask) 


Here's an approach:

  • Convert image to grayscale and Gaussian blur
  • Perform canny edge detection
  • Perform morphological operations to smooth image
  • Find contours and filter using a minimum/maximum threshold area
  • Remove portrait images

Here's the detected portraits highlighted in green

Now that we have the bounding box ROIs, we can effectively remove the pictures by filling them in with white. Here's the result

import cv2

image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
canny = cv2.Canny(blur, 120, 255, 1)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
close = cv2.morphologyEx(canny, cv2.MORPH_CLOSE, kernel, iterations=2)

cnts = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

for c in cnts:
    area = cv2.contourArea(c)
    if area > 15000 and area < 35000:
        x,y,w,h = cv2.boundingRect(c)
        cv2.rectangle(image, (x, y), (x + w, y + h), (255,255,255), -1)

cv2.imshow('image', image)

