drop non-json object rows from python dataframe column

后端 未结 3 1113
孤城傲影
孤城傲影 2021-01-22 23:43

I have a dataframe such that the column contains both json objects and strings. I want to get rid of rows that does not contains json objects.

Below is how my dataframe

相关标签:
3条回答
  • 2021-01-23 00:13
    df[df.applymap(np.isreal).sum(1).gt(0)]
    Out[794]: 
                                A
    2    {'a': 5, 'b': 6, 'c': 8}
    5  {'d': 9, 'e': 10, 'f': 11}
    
    0 讨论(0)
  • 2021-01-23 00:16

    I think I would prefer to use an isinstance check:

    In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))]
    Out[11]:
                                A
    2    {'a': 5, 'b': 6, 'c': 8}
    5  {'d': 9, 'e': 10, 'f': 11}
    

    If you want to include numbers too, you can do:

    In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
    Out[12]:
                                A
    2    {'a': 5, 'b': 6, 'c': 8}
    5  {'d': 9, 'e': 10, 'f': 11}
    

    Adjust this to whichever types you want to include...


    The last step, json_normalize takes a list of json objects, for whatever reason a Series is no good (and gives the KeyError), you can make this a list and your good to go:

    In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))]
    
    In [22]: json_normalize(list(df1["A"]))
    Out[22]:
         a    b    c    d     e     f
    0  5.0  6.0  8.0  NaN   NaN   NaN
    1  NaN  NaN  NaN  9.0  10.0  11.0
    
    0 讨论(0)
  • 2021-01-23 00:16

    If you want an ugly solution that also works...here's a function I created that finds columns that contain only strings, and returns the df minus those rows. (since your df has only one column, you'll just dataframe containing 1 column with all dicts). Then, from there, you'll want to use df = json_normalize(df['A'].values) instead of just df = json_normalize(df['A']).

    For a single column dataframe...

    import pandas as pd
    import numpy as np
    from pandas.io.json import json_normalize
    def delete_strings(df):
        nrows = df.shape[0]
        rows_to_keep = []
        for row in np.arange(nrows):
            if type(df.iloc[row,0]) == dict:
                rows_to_keep.append(row) #add the row number to list of rows 
                                         #to keep if the row contains a dict
        return df.iloc[rows_to_keep,0] #return only rows with dicts
    df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",
                             {"a":9,"b":10,"c":11}]})
    df = delete_strings(df)
    df = json_normalize(df['A'].values)
    print(df)
    #0      {'a': 5, 'b': 6, 'c': 8}
    #1    {'a': 9, 'b': 10, 'c': 11}   
    

    For a multi-column df (also works with a single column df):

    def delete_rows_of_strings(df):
        rows = df.shape[0] #of rows in df
        cols = df.shape[1] #of coluns in df
        rows_to_keep = [] #list to track rows to keep
        for row in np.arange(rows): #for every row in the dataframe
            #num_string will count the number of strings in the row
            num_string = 0
            for col in np.arange(cols):  #for each column in the row...
                #if the value is a string, add one to num_string
                if type(df.iloc[row,col]) == str:
                    num_string += 1
            #if num_string, the number of strings in the column,
            #isn't equal to the number of columns in the row...
            if num_string != cols: #...add that row number to the list of rows to keep
                rows_to_keep.append(row)
        #return the df with rows containing at least one non string
        return(df.iloc[rows_to_keep,:])
    
    
    df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"],
                            'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']})
    #                          A                         B
    #0                     hello                        hi
    #1                     world  {'a': 5, 'b': 6, 'c': 8}
    #2  {'a': 5, 'b': 6, 'c': 8}                       sup
    print(delete_rows_of_strings(df))
    #                          A                         B
    #1                     world  {'a': 5, 'b': 6, 'c': 8}
    #2  {'a': 5, 'b': 6, 'c': 8}                       sup
    
    0 讨论(0)
提交回复
热议问题