I write the code to scrap car info(title, make, model, transmission, year, price) data from ebay.com and save in the mysql, I want if all row's(title, make, model, ...) item's is be similar to another row then avoid to insert this data to the mysql, *only when all row's item be similar(because some title is simialr or some model or...)
code :
import requests
from bs4 import BeautifulSoup
import re
import mysql.connector
conn = mysql.connector.connect(user='root', password='******',
host='', database='web_scraping')
cursor = conn.cursor()
url = 'https://www.ebay.com/b/Cars-Trucks/6001?_ fsrp=0&_sacat=6001&LH_BIN=1&LH_ItemCondition=3000%7C1000%7C2500&rt=nc&_stpos=95125&Model%2520Year=2020%7C2019%7C2018%7C2017%7C2016%7C2015'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
ebay_cars = soup.find_all('li', class_='s-item')
for car_info in ebay_cars:
title_div = car_info.find('div', class_='s-item__wrapper clearfix')
title_sub_div = title_div.find('div', class_='s-item__info clearfix')
title_p = title_sub_div.find('span', class_='s-item__price')
title_tag = title_sub_div.find('a', class_='s-item__link')
title_maker = title_sub_div.find('span', class_='s-item__dynamic s-
title_model = title_sub_div.find('span', class_='s-item__dynamic s-
title_trans = title_sub_div.find('span', class_='s-item__dynamic s-
name_of_car = re.sub(r'\d{4}', '', title_tag.text)
maker_of_car = re.sub(r'Make: ','', title_maker.text)
model_of_car = re.sub(r'Model: ', '', title_model.text)
if title_trans.text.startswith(r'Transmission: '):
trans_of_car = re.sub(r'Transmission: ', '', title_trans.text)
trans_of_car = ''
except AttributeError:
trans_of_car = ''
year_of_car = re.findall(r'\d{4}', title_tag.text)
year_of_car = ''.join(str(x) for x in year_of_car)
price_of_car = title_p.text
print(name_of_car ,trans_of_car )
sql = 'INSERT INTO car_info(Title, Maker, Model, Transmission, Year, Price)
VALUES (%s, %s, %s, %s, %s, %s)'
cursor.execute(sql , (name_of_car, maker_of_car, model_of_car, trans_of_car,
year_of_car, price_of_car))
One option uses not exists
insert into car_info (title, maker, model, transmission, year, price)
select v.*
from (select %s title, %s maker, %s model, %s transmission, %s year, %s price) v
where not exists (
select 1
from car_info c
(c.title, c.maker, c.model, c.transmission, c.year, c.price)
= (v.title, v.maker, v.model, v.transmission, v.year, v.price)
But it would be simpler to create a unique key on all columns of the table, like:
create unique index idx_car_info_uniq
on car_info(title, maker, model, transmission, year, price);
This prevents any process from inserting duplicates in the table. You can elegantly ignore the erros that would otherwise have been raised with the on duplicate key
insert into car_info (title, maker, model, transmission, year, price)
values (%s, %s, %s, %s, %s, %s)
on duplicate key update title = values(title);
You could save the result of this query into a variable
SELECT COUNT(*) FROM car_info WHERE Title = <titleValue>, Maker = <makerValue>, Model = <modelValue>, Transmission = <transmisionValue>, Year = <yearValue>, Price = <priceValue>
and then, if the value of the variable is
- 1, you skip the INSERT because you already have this entry in the table
- 0, you make the INSERT because you do not have that entry in the table
It's just one way of doing this.
declare the primary key as all the columns in the table. See: https://www.mysqltutorial.org/mysql-primary-key/