问题
Update 5:
This feature has been released as part of pandas 20.1 (on my birthday :] )
Update 4:
PR has been merged!
Update 3:
The PR has moved here
Update 2:
It seems like this question may have contributed to re-opening the PR for IntervalIndex in pandas.
Update:
I no longer have this problem, since I'm actually now querying for overlapping ranges from A
and B
, not points from B
which fall within ranges in A
, which is a full interval tree problem. I won't delete the question though, because I think it's still a valid question, and I don't have a good answer.
Problem statement
I have two dataframes.
In dataframe A
, two of the integer columns taken together represent an interval.
In dataframe B
, one integer column represents a position.
I'd like to do a sort of join, such that points are assigned to each interval they fall within.
Intervals are rarely but occasionally overlapping. If a point falls within that overlap, it should be assigned to both intervals. About half of points won't fall within an interval, but nearly every interval will have at least one point within its range.
What I've been thinking
I was initially going to dump my data out of pandas, and use intervaltree or banyan or maybe bx-python but then I came across this gist. It turns out that the ideas shoyer has in there never made it into pandas, but it got me thinking -- it might be possible to do this within pandas, and since I want this code to be as fast as python can possibly go, I'd rather not dump my data out of pandas until the very end. I also get the feeling that this is possible with bins
and pandas cut function, but I'm a total newbie to pandas, so I could use some guidance! Thanks!
Notes
Potentially related? Pandas DataFrame groupby overlapping intervals of variable length
回答1:
This feature is was released as part of pandas 20.1
回答2:
Answer using pyranges, which is basically pandas sprinkled with bioinformatics sugar.
Setup:
import numpy as np
np.random.seed(0)
import pyranges as pr
a = pr.random(int(1e6))
# +--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand |
# | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------|
# | chr1 | 8830650 | 8830750 | + |
# | chr1 | 9564361 | 9564461 | + |
# | chr1 | 44977425 | 44977525 | + |
# | chr1 | 239741543 | 239741643 | + |
# | ... | ... | ... | ... |
# | chrY | 29437476 | 29437576 | - |
# | chrY | 49995298 | 49995398 | - |
# | chrY | 50840129 | 50840229 | - |
# | chrY | 38069647 | 38069747 | - |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
b = pr.random(int(1e6), length=1)
# +--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand |
# | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------|
# | chr1 | 52110394 | 52110395 | + |
# | chr1 | 122640219 | 122640220 | + |
# | chr1 | 162690565 | 162690566 | + |
# | chr1 | 117198743 | 117198744 | + |
# | ... | ... | ... | ... |
# | chrY | 45169886 | 45169887 | - |
# | chrY | 38863683 | 38863684 | - |
# | chrY | 28592193 | 28592194 | - |
# | chrY | 29441949 | 29441950 | - |
# +--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 1,000,000 rows and 4 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Execution:
result = a.join(b, strandedness="same")
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# | Chromosome | Start | End | Strand | Start_b | End_b | Strand_b |
# | (category) | (int32) | (int32) | (category) | (int32) | (int32) | (category) |
# |--------------+-----------+-----------+--------------+-----------+-----------+--------------|
# | chr1 | 227348436 | 227348536 | + | 227348516 | 227348517 | + |
# | chr1 | 18901135 | 18901235 | + | 18901191 | 18901192 | + |
# | chr1 | 230131576 | 230131676 | + | 230131636 | 230131637 | + |
# | chr1 | 84829850 | 84829950 | + | 84829903 | 84829904 | + |
# | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 44139791 | 44139891 | - | 44139821 | 44139822 | - |
# | chrY | 51689785 | 51689885 | - | 51689859 | 51689860 | - |
# | chrY | 45379140 | 45379240 | - | 45379215 | 45379216 | - |
# | chrY | 37469479 | 37469579 | - | 37469576 | 37469577 | - |
# +--------------+-----------+-----------+--------------+-----------+-----------+--------------+
# Stranded PyRanges object has 16,153 rows and 7 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
df = result.df
# Chromosome Start End Strand Start_b End_b Strand_b
# 0 chr1 227348436 227348536 + 227348516 227348517 +
# 1 chr1 18901135 18901235 + 18901191 18901192 +
# 2 chr1 230131576 230131676 + 230131636 230131637 +
# 3 chr1 84829850 84829950 + 84829903 84829904 +
# 4 chr1 189088140 189088240 + 189088163 189088164 +
# ... ... ... ... ... ... ... ...
# 16148 chrY 38968068 38968168 - 38968124 38968125 -
# 16149 chrY 44139791 44139891 - 44139821 44139822 -
# 16150 chrY 51689785 51689885 - 51689859 51689860 -
# 16151 chrY 45379140 45379240 - 45379215 45379216 -
# 16152 chrY 37469479 37469579 - 37469576 37469577 -
#
# [16153 rows x 7 columns]
来源:https://stackoverflow.com/questions/41983874/interval-intersection-in-pandas