How to remove extra commas from data in Python

问题

I have a CSV file through which I am trying to load data into my SQL table containing 2 columns. I have 2 columns and the data is separated by commas, which identify the next field. The second column contains text and some commas in that text. Because of the extra commas I am not able to load data into my SQL table as it looks like it has extra columns. I have millions of rows of data. How can I remove these extra commas?

Data:

Number Address
"12345" , "123 abc street, Unit 345"
"67893" , "567 xyz lane"
"65432" , "789 unit, mno street"

I would like to remove the extra commas in the addresses in random rows.

回答1:

If all your data will be in the same format, as Number Address "000" , "000 abc street, Unit 000", you can split the list, remove the comma, and put the list back together, making it a string again. For example using the data you gave:

ori_addr = "Number Address \"12345\" , \"123 abc street, Unit 345\""
addr = ori_addr.split()
addr[6] = addr[6].replace(",", "")
together_addr = " ".join(addr)

together_addr is equal to "Number Address "12345" , "123 abc street Unit 345" note that there is no comma between "street" and "Unit."

回答2:

Edits:

Following user's comments, added a failing address to this test. This address loads to the database without issue.
Added code to store CSV addresses into MySQL.

Answer:

The code below performs the following actions:

MySQL database engine (connection) created.
Address data (number, address) read from CSV file.
Non-field separating commas replaced from source data, and extra whitespace removed.
Edited data fed into a DataFrame
DataFrame used to store data into MySQL.

    import csv
    import pandas as pd
    from sqlalchemy import create_engine

    # Set database credentials.
    creds = {'usr': 'admin',
             'pwd': '1tsaSecr3t',
             'hst': '127.0.0.1',
             'prt': 3306,
             'dbn': 'playground'}
    # MySQL conection string.
    connstr = 'mysql+mysqlconnector://{usr}:{pwd}@{hst}:{prt}/{dbn}'
    # Create sqlalchemy engine for MySQL connection.
    engine = create_engine(connstr.format(**creds))

    # Read addresses from mCSV file.
    text = list(csv.reader(open('comma_test.csv'), skipinitialspace=True))

    # Replace all commas which are not used as field separators.
    # Remove additional whitespace.
    for idx, row in enumerate(text):
        text[idx] = [i.strip().replace(',', '') for i in row]

    # Store data into a DataFrame.
    df = pd.DataFrame(data=text, columns=['number', 'address'])
    # Write DataFrame to MySQL using the engine (connection) created above.
    df.to_sql(name='commatest', con=engine, if_exists='append', index=False)

Source File (`comma_test.csv`):

"12345" , "123 abc street, Unit 345"
"10101" , "111 abc street, Unit 111"
"20202" , "222 abc street, Unit 222"
"30303" , "333 abc street, Unit 333"
"40404" , "444 abc street, Unit 444"
"50505" , "abc DR, UNIT# 123 UNIT 123"

Unedited Data:

['12345 ', '123 abc street, Unit 345']
['10101 ', '111 abc street, Unit 111']
['20202 ', '222 abc street, Unit 222']
['30303 ', '333 abc street, Unit 333']
['40404 ', '444 abc street, Unit 444']
['50505 ', 'abc DR, UNIT# 123 UNIT 123']

Edited Data:

['12345', '123 abc street Unit 345']
['10101', '111 abc street Unit 111']
['20202', '222 abc street Unit 222']
['30303', '333 abc street Unit 333']
['40404', '444 abc street Unit 444']
['50505', 'abc DR UNIT# 123 UNIT 123']

Queried from MySQL:

number  address
12345   123 abc street Unit 345
10101   111 abc street Unit 111
20202   222 abc street Unit 222
30303   333 abc street Unit 333
40404   444 abc street Unit 444
50505   abc DR UNIT# 123 UNIT 123