问题
I'm not sure how to word my problem. But here it is...
I have a huge list of 1s and 0s [Total length = 53820].
Example of how the list looks like -
[0,1,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,1,1...........]
The visualization is given below.
x-axis: index of the element (from 0 to 53820)
y-axis: value at that index (i.e. 1 or 0)
Input Plot-->
The plot clearly shows 3 dense areas where the occurrence of 1s is more. I have drawn on top of the plot to show the visually dense areas. (ugly black lines on the plot). I want to know the index numbers on the x-axis of the dense areas (start and end boundaries) on the plot.
I have extracting the chunks of 1s and saving the start indexes of each in a new list named 'starts'. That function returns a list of dictionaries like this:
{'start': 0, 'count': 15, 'end': 16}, {'start': 2138, 'count': 3, 'end': 2142}, {'start': 2142, 'count': 3, 'end': 2146}, {'start': 2461, 'count': 1, 'end': 2463}, {'start': 2479, 'count': 45, 'end': 2525}, {'start': 2540, 'count': 2, 'end': 2543}
Then in starts, after setting a threshold, compared adjacent elements. Which returns the apparent boundaries of the dense areas.
THR = 2000
results = []
cues = {'start': 0, 'stop': 0}
result,starts = densest(preds) # Function that returns the list of dictionaries shown above
cuestart = False # Flag to check if looking for start or stop of dense boundary
for i,j in zip(range(0,len(starts)), range(1,len(starts))):
now = starts[i]
nextf = starts[j]
if(nextf-now > THR):
if(cuestart == False):
cues['start'] = nextf
cues['stop'] = nextf
cuestart = True
elif(cuestart == True): # Cuestart is already set
cues['stop'] = now
cuestart = False
results.append(cues)
cues = {'start': 0, 'stop': 0}
print('\n',results)
The output and corresponding plot looks like this.
[{'start': 2138, 'stop': 6654}, {'start': 23785, 'stop': 31553}, {'start': 38765, 'stop': 38765}]
Output Plot -->
This method fails to get the last dense region as seen in the plot, and also for other data of similar sorts.
P.S. I have also tried 'KDE' on this data and 'distplot' using seaborn but that gives me plots directly and I am unable to extract the boundary values from that. The link for that question is here (Getting dense region boundary values from output of KDE plot)
回答1:
OK, you need an answer...
First, the imports (we are going to use LineCollections
)
import numpy as np ; import matplotlib.pyplot as plt ;
from matplotlib.collections import LineCollection
Next, definition of constants
N = 1001 ; np.random.seed(20190515)
and generation of fake data
x = np.linspace(0,1, 1001)
prob = np.where(x<0.4, 0.02, np.where(x<0.7, 0.95, 0.02))
y = np.where(np.random.rand(1001)<prob, 1, 0)
here we create the line collection, sticks
is a N×2×2
array
containing the start and end points of our vertical lines
sticks = np.array(list(zip(zip(x, np.zeros(N)), zip(x, y))))
lc = LineCollection(sticks)
finally, the cumulated sum, here normalized to have the same scale as the vertical lines
cs = (y-0.5).cumsum()
csmin, csmax = min(cs), max(cs)
cs = (cs-csmin)/(csmax-csmin) # normalized to 0 ÷ 1
We have just to plot our results
f, a = plt.subplots()
a.add_collection(lc)
a.plot(x, cs, color='red')
a.grid()
a.autoscale()
Here it is the plot
and here a detail of the stop zone.
You can smooth the cs
data and use something from scipy.optimize
to
spot the position of extremes. Should you have a problem in this last
step please ask another question.
来源:https://stackoverflow.com/questions/56130596/extracting-boundaries-of-dense-regions-of-1s-in-a-huge-list-of-1s-and-0s