问题
I solved the following problem in bash, but I feel it's quite inefficient and very slow given the size of files I need to reduce. Was hoping somebody has an idea how to do the same in Python and hopefully speed things up.
The original problem was to reduce very large text files (50-60 million lines, tab delimited columns). One of the columns is being treated as a key, i.e. we determine how many lines with a unique key are in the file and then randomly select a percentage of them (for example a quarter of total number if reducing by 75%) to append to a new file that will keep our results. We continue to go through the rest of the keys, randomizing and then reducing all lines containing each unique key by the same percentage. In case the reduction can't be done - we simply carry all the lines over to the resulting file.
As I said, my bash script works quite well, but it is slow and strings together various awk and grep constructs. By all accounts, Python should deal with this in a much more elegant way and without compromising memory too much (again, we are dealing with 50+ million lines files in this case). Any suggestions/tricks would be helpful! Thanks!
回答1:
The simple solution would be to sort a file by the key column e.g., sort tab-separated input by the second column:
#!/bin/bash
printf "a\tz\nb\ty\nc\tx" | sort -k 2 -t $'\t'
And then solve a simpler problem of retrieving 25% of random lines for each unique key where all lines with equal keys are adjacent with a constraint that at least one line for each unique key should be preserved:
#!/usr/bin/env python
import random
import sys
from itertools import chain, groupby
def choose_random(iterator, fraction, random=random.random):
"""Lazy analog of:
L = list(iterator)
k = int(len(L) * fraction + .5) or 1 # get at least one
result = random.sample(L, k)
Note: this function doesn't randomize the order of elements
that would require to keep selected elements in memory
and number of output elements is not exactly k
"""
# always yield at least one item if input is not empty
item = next(iterator)
it = (x for x in chain([item], iterator) if random() < fraction)
for x in chain([next(it, item)], it):
yield x
def getkey(line):
return line.split("\t")[1] # 2nd column
for key, group in groupby(sys.stdin, key=getkey):
sys.stdout.writelines(choose_random(group, fraction=0.25))
Note: the last line in the input file should contain a newline otherwise the output is corrupted if the last line is chosen.
The script accepts sorted (by the key column) input on stdin and prints reduced output to stdout. It requires to store only one line in memory at a time. It is a single-pass algorithm (O(n)).
回答2:
Because your problem is vague, I will give a high level solution
- Do not read the entire file in memory
fileObj.read()
orfileObj.readlines()
, rather iterate through the filefor line in fileObj
. Why? This will be memory friednly Create your implementation of Queue based on List
class Queue(object): def __init__(self, max_size): self.queue = [] self.max_size = max_size def __getitem__(self, index): if 0 <= index < max_size: return self.queue[index] else: raise IndexError def __iter__(self): return iter(self.queue) def push(seq): if isinstance(seq, Iterable): if len(self.queue) + len(seq) > self.max_size: raise Full self.queue = seq else: if len(self.queue) + 1 > self.max_size: raise Full self.queue.append(seq) def pop(): if self.queue: return self.queue.pop(0)
Create a dictionary of Queue with a maxsize = 2 * percentage of selected items
Something like
PCT_SELECTED = 100
MAXSIZE = 2 * PCT_SELECTED
KEY_START = 10
KEY_STOP = 15
from collection import defaultdict
queue_dict = defaultdict(Queue(MAXSIZE))
- Put elements in the Queue in a non blocking fasion
- If Queue is FULL, it will raise the Exception
Full
, in which case, you randomly select 50 % of elements from the Queue and discard the rest.
something like
with open("your-file") as fin:
for line in fin:
key = line[KEY_START: KEY_STOP]
try:
queue_dict[key].push(line)
except Full:
queue_dict[key] = random.sample(queue_dict[key], PCT_SELECTED)
Finally iterate through the dictionary and trim out the Queue randomly
queue_dict = {key: random.sample(value, PCT_SELECTED) for key, value in queue_dict.items()}
Now you can read through the dictinary and write to a file.
回答3:
For a large number of items just selecting 75% can be done by checking a random number for each one.
import random
with open('input') as f:
for line in f:
if random.random() < 0.75:
print line
And if you need to guarantee at least one item from each key (even if it only has two lines):
import random
keys = set()
with open('input') as f:
for line in f:
columns = line.split('\t')
key = columns[0]
if not key in keys:
print line
keys.add(key)
continue
if random.random() < 0.75:
print line
来源:https://stackoverflow.com/questions/15109251/text-file-reduction-with-randomization-in-python