问题
Let us say I have the following string:
"my ., .,dog. .jumps. , .and..he. .,is., .a. very .,good, .dog"
1234567890123456789012345678901234567890123456789012345678901 <-- char pos
Now, I have written a regular expression to remove certain elements from the string above, in this example, all whitespace, all periods, and all commas.
I am left with the following transformed string:
"mydogjumpsandheisaverygooddog"
Now, I want to construct k-grams of this string. Let us say I were to take 5-grams of the above string, it would look like:
mydog ydogj dogju ogjum gjump jumps umpsa ...
The problem I have is that for each k-gram, I want to keep track of its original character position in the first source text I listed.
So, "mydog", would have a start position of "0" and an end position of "11". However, I have no mapping between the source text and the modified text. So, I have no idea where a particular k-gram starts and ends in relation to the original, unmodified text. This is important to my program to keep track of.
I am creating a list of k-grams like this:
public class Kgram
{
public int start;
public int end;
public int text;
}
where start
and end
are positions in the source text (top) and the text is that of the k-gram text after the modifications.
Can anyone point me in the right direction for the best way to solve this problem?
回答1:
Don't use a regular expression 'replace' API to do your replacing. Only use regexps to find the places you want to modify, do the mod yourself, and maintain an offset mapping. One form I've used is an array of ints as big as the original string, storing 'n chars deleted' here values, but there are a host of other possibilities.
The basic data structure here is an array of pairs. Each pair contains an offset and a correction. Depending on time/space tradeoffs, you may prefer to spread the information out over a data structure as large as the original string.
回答2:
Here's how I would solve this problem in Haskell:
kgramify k string =
let charsWithPos = zip string [1..] -- attach original position to each char
goodCWP = filter (not o isWhitePeriodOrComma o fst) charsWithPos -- drop nasty chars
groups = takeEveryK k goodCWP -- clump remaining chars in groups of size k
posnOfGroup g = (snd (head g), map fst g) -- position of first char with group
in map posnOfGroup groups
In informal English:
- Tag each character with its position
- Filter out uninteresting (character, position) pairs
- Take the remaining list of pairs and group them into a list of lists of length
k
- For each inner list, take the position of the first character, and pair it with the list of the all the characters (with the positions dropped)
In any functional language like Clean, Haskell, ML, or Scheme, this kind of thing is very easy. In a language with explicit memory allocation (new
) or even worse, malloc
and free
, such a solution would be very tedious.
回答3:
A C solution, to show that as Norman Ramsey says, it's pretty tedious. It takes the filter as a callback with context, just for kicks, but in your case you could pass 0
as the filterdata and not_wspc
as the filter:
int not_wspc(void *, char c) {
if isspace((unsigned char)c) return 0;
if ((c == '.') || (c == ',')) return 0;
return 1;
}
typedef struct {
char c;
int pos;
} charwithpos;
KGram *foo(const char *input, int (*filter)(void *,char), void *filterdata) {
size_t len = strlen(input);
charwithpos *filtered = malloc(len * sizeof(*filtered));
assert(filtered);
// combine Norman's zip and filter steps
charwithpos *current = filtered
for (size_t i = 0; i < len; ++i) {
if (filter(filterdata, input[i])) {
current->c = input[i];
current->pos = i;
++current;
}
}
size_t shortlen = (current - filtered);
// wouldn't normally recommend returning malloced data, but
// illustrates the point.
KGram *result = malloc((shortlen / 5 + 1) * sizeof(*result));
assert(result);
// take each 5 step
KGram *currentgram = result;
current = filtered;
for (size_t i = 0; i < shortlen; ++i) {
currentgram->text[i%5] = current->c;
if ((i % 5) == 0) {
currentgram->start = current->pos;
} else if ((i % 5) == 4) {
currentgram->end = current->pos;
++currentgram;
}
++current;
}
if (shortlen % 5) != 0 {
currentgram->end = filtered[shortlen-1].pos;
currentgram->text[shortlen%5] = 0;
}
free(filtered);
return(result);
}
Or something like that, I can't be actually compiling and testing it. Obviously this has the significant weakness that filtered
sees the chars one at a time, which means it cannot apply backtracking algorithms. You could get around it by passing the whole string into the filter, so that if necessary it can do a lot of work on the first call, and store the results to return on all the rest of the calls. But if you need to apply regular-expression-like logic to arbitrary types, then C is probably not the right language to use.
Here's the beginnings of a C++ solution, without even using <functional>
. Not sure what Norman saying about languages with new
: just because the language has it doesn't mean you have to use it ;-)
template <typename OutputIterator>
struct KGramOutput {
OutputIterator dest;
KGram kgram;
KGramOutput(OutputIterator dest) : dest(dest) {}
void add(char, size_t);
void flush(void);
};
template <typename InputIterator, typename OutputIterator, typename Filter>
void foo(InputIterator first, InputIterator last, OutputIterator dest, Filter filter) {
size_t i = 0;
KGramOutput<OutputIterator> kgram(dest);
while (first != last) {
if (filter(*first)) kgram.add(*first, i);
++first;
++i;
}
kgram.flush();
}
The add
and flush
functions are a bit tedious, they have to bundle up 5 pairs into a KGram struct, and then do *dest++ = kgram
. The user could pass for example a pushback_iterator
over a vector<KGram>
as the output iterator. Btw the '5' and the 'char' could be template parameters too.
回答4:
This can be done in a single pass without needing to construct intermediate character-position pairs:
(defclass k-gram ()
((start :reader start :initarg :start)
(end :accessor end)
(text :accessor text)))
(defmethod initialize-instance :after ((k-gram k-gram) &rest initargs &key k)
(declare (ignorable initargs))
(setf (slot-value k-gram 'text) (make-array k :element-type 'character)))
(defun k-gramify (string k ignore-string)
"Builds the list of complete k-grams with positions from the original
text, but with all characters in ignore-string ignored."
(loop
for character across string
for position upfrom 0
with k-grams = ()
do (unless (find character ignore-string)
(push (make-instance 'k-gram :k k :start position) k-grams)
(loop
for k-gram in k-grams
for i upfrom 0 below k
do (setf (aref (text k-gram) i) character
(end k-gram) (1+ position))))
finally (return (nreverse (nthcdr (- k 1) k-grams)))))
来源:https://stackoverflow.com/questions/2303924/how-can-i-keep-track-of-character-positions-after-i-remove-elements-from-a-strin