How can I remove sharp jumps in data?

后端 未结 2 938
别跟我提以往
别跟我提以往 2021-02-09 06:35

I have some skin temperature data (collected at 1Hz) which I intend to analyse.

However, the sensors were not always in contact with the skin. So I have a challenge of

2条回答
  •  自闭症患者
    2021-02-09 07:27

    Here's a suggestion that targets your issues regarding

    1. [...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.

    2. [..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?

    using stats.zscore() and pandas.merge()

    As it is, it will still have a minor issue with your concerns regarding

    [...]left with some residual artefacts from the data jumps near the edges[...]

    But we'll get to that later.

    First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:

    # Imports
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np
    from scipy import stats
    
    np.random.seed(22)
    
    # A function for noisy data with a trend element
    def sample():
    
        base = 100
        nsample = 50
        sigma = 10
        
        # Basic df with trend and sinus seasonality 
        trend1 = np.linspace(0,1, nsample)
        y1 = np.sin(trend1)
        dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
        df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
        df = df.set_index(['dates'])
        df.index = pd.to_datetime(df.index)
        
        # Gaussian Noise with amplitude sigma
        df['y2'] = sigma * np.random.normal(size=nsample)
        df['y3'] = df['y2'] + base + (np.sin(trend1))
        df['trend2'] = 1/(np.cos(trend1)/1.05)
        df['y4'] = df['y3'] * df['trend2']
        
        df=df['y4'].to_frame()
        df.columns = ['Temp']
        
        df['Temp'][20:31] = np.nan
            
        # Insert spikes and missing values
        df['Temp'][19] = df['Temp'][39]/4000
        df['Temp'][31] = df['Temp'][15]/4000
        
        return(df)
        
    # Dataframe with random data
    df_raw = sample()
    df_raw.plot()
    

    As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:

    But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:

    This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.

    The steps:

    # 1. Get some info about the original data:
    firstVal = df_raw[:1]
    colName = df_raw.columns
    
    # 2. Take the first difference and 
    df_diff = df_raw.diff()
    
    # 3. Remove missing values
    df_clean = df_diff.dropna()
    
    # 4. Select a level for a Z-score to identify and remove outliers
    level = 3
    df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
    ix_keep = df_Z.index
    
    # 5. Subset the raw dataframe with the indexes you'd like to keep
    df_keep = df_raw.loc[ix_keep]
    
    # 6. 
    # df_keep will be missing some indexes.
    # Do the following if you'd like to keep those indexes
    # and, for example, fill missing values with the previous values
    df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
    
    # 7. Keep only the first column
    df_out = df_out.ix[:,0].to_frame()
    
    # 8. Fill missing values
    df_complete = df_out.fillna(axis=0, method='ffill')
    
    # 9. Replace first value
    df_complete.iloc[0] = firstVal.iloc[0]
    
    # 10. Reset column names
    df_complete.columns = colName
    
    # Result
    df_complete.plot()
    

    Here's the whole thing for an easy copy-paste:

    # Imports
    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np
    from scipy import stats
    
    np.random.seed(22)
    
    # A function for noisy data with a trend element
    def sample():
    
        base = 100
        nsample = 50
        sigma = 10
        
        # Basic df with trend and sinus seasonality 
        trend1 = np.linspace(0,1, nsample)
        y1 = np.sin(trend1)
        dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
        df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
        df = df.set_index(['dates'])
        df.index = pd.to_datetime(df.index)
        
        # Gaussian Noise with amplitude sigma
        df['y2'] = sigma * np.random.normal(size=nsample)
        df['y3'] = df['y2'] + base + (np.sin(trend1))
        df['trend2'] = 1/(np.cos(trend1)/1.05)
        df['y4'] = df['y3'] * df['trend2']
        
        df=df['y4'].to_frame()
        df.columns = ['Temp']
        
        df['Temp'][20:31] = np.nan
            
        # Insert spikes and missing values
        df['Temp'][19] = df['Temp'][39]/4000
        df['Temp'][31] = df['Temp'][15]/4000
        
        return(df)
    
    # A function for removing outliers
    def noSpikes(df, level, keepFirst):
    
        # 1. Get some info about the original data:
        firstVal = df[:1]
        colName = df.columns
        
        # 2. Take the first difference and 
        df_diff = df.diff()
        
        # 3. Remove missing values
        df_clean = df_diff.dropna()
        
        # 4. Select a level for a Z-score to identify and remove outliers
        df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
        ix_keep = df_Z.index
        
        # 5. Subset the raw dataframe with the indexes you'd like to keep
        df_keep = df_raw.loc[ix_keep]
        
        # 6. 
        # df_keep will be missing some indexes.
        # Do the following if you'd like to keep those indexes
        # and, for example, fill missing values with the previous values
        df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
        
        # 7. Keep only the first column
        df_out = df_out.ix[:,0].to_frame()
        
        # 8. Fill missing values
        df_complete = df_out.fillna(axis=0, method='ffill')
        
        # 9. Reset column names
        df_complete.columns = colName
        
        # Keep the first value
        if keepFirst:
            df_complete.iloc[0] = firstVal.iloc[0]
        
        return(df_complete)
    
    # Dataframe with random data
    df_raw = sample()
    df_raw.plot()
    
    # Remove outliers
    df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
            
    df_cleaned.plot()
    

提交回复
热议问题