问题
I am trying to convert numerical values written as words into integers. For example, "iPhone has two hundred and thirty thousand seven hundred and eighty three apps" would become "iPhone as 230783 apps"
Before i start coding, I would like to know if any function / code exists for this conversion.
回答1:
There are lots of pages discussing the conversion from numbers to words. Not so many for the reverse direction. The best I could find was some pseudo-code on Ask Yahoo. See http://answers.yahoo.com/question/index?qid=20090216103754AAONnDz for a nice algorithm:
Well, overall you are doing two things: Finding tokens (words that translates to numbers) and applying grammar. In short, you are building a parser for a very limited language.
The tokens you would need are:
POWER: thousand, million, billion
HUNDRED: hundred
TEN: twenty, thirty... ninety
UNIT: one, two, three, ... nine,
SPECIAL: ten, eleven, twelve, ... nineteen(drop any "and"s as they are meaningless. Break hyphens into two tokens. That is sixty-five should be processed as "sixty" "five")
Once you've tokenized your string, move from RIGHT TO LEFT.
Grab all the tokens from the RIGHT until you hit a POWER or the whole string.
Parse the tokens after the stop point for these patterns:
SPECIAL
TEN
UNIT
TEN UNIT
UNIT HUNDRED
UNIT HUNDRED SPECIAL
UNIT HUNDRED TEN
UNIT HUNDRED UNIT
UNIT HUNDRED TEN UNIT(This assumes that "seventeen hundred" is not allowed in this grammar)
This gives you the last three digits of your number.
If you stopped at the whole string you are done.
If you stopped at a power, start again at step 1 until you reach a higher POWER or the whole string.
回答2:
Old question, but for anyone else coming across this I had to write up a solution to this today. The following takes a vaguely similar approach to the algorithm described by John Kugelman, but doesn't apply as strict a grammar; as such it will permit some weird orderings, e.g. "one hundred thousand and one million" will still produce the same as "one million and one hundred thousand" (1,100,000). Invalid bits (e.g. misspelled numbers) will be ignored, so the consider the output on invalid strings to be undefined.
Following user132513's comment on joebert's answer, I used Pear's Number_Words to generate test series. The following code scored 100% on numbers between 0 and 5,000,000 then 100% on a random sample of 100,000 numbers between 0 and 10,000,000 (it takes to long to run over the whole 10 billion series).
/**
* Convert a string such as "one hundred thousand" to 100000.00.
*
* @param string $data The numeric string.
*
* @return float or false on error
*/
function wordsToNumber($data) {
// Replace all number words with an equivalent numeric value
$data = strtr(
$data,
array(
'zero' => '0',
'a' => '1',
'one' => '1',
'two' => '2',
'three' => '3',
'four' => '4',
'five' => '5',
'six' => '6',
'seven' => '7',
'eight' => '8',
'nine' => '9',
'ten' => '10',
'eleven' => '11',
'twelve' => '12',
'thirteen' => '13',
'fourteen' => '14',
'fifteen' => '15',
'sixteen' => '16',
'seventeen' => '17',
'eighteen' => '18',
'nineteen' => '19',
'twenty' => '20',
'thirty' => '30',
'forty' => '40',
'fourty' => '40', // common misspelling
'fifty' => '50',
'sixty' => '60',
'seventy' => '70',
'eighty' => '80',
'ninety' => '90',
'hundred' => '100',
'thousand' => '1000',
'million' => '1000000',
'billion' => '1000000000',
'and' => '',
)
);
// Coerce all tokens to numbers
$parts = array_map(
function ($val) {
return floatval($val);
},
preg_split('/[\s-]+/', $data)
);
$stack = new SplStack; // Current work stack
$sum = 0; // Running total
$last = null;
foreach ($parts as $part) {
if (!$stack->isEmpty()) {
// We're part way through a phrase
if ($stack->top() > $part) {
// Decreasing step, e.g. from hundreds to ones
if ($last >= 1000) {
// If we drop from more than 1000 then we've finished the phrase
$sum += $stack->pop();
// This is the first element of a new phrase
$stack->push($part);
} else {
// Drop down from less than 1000, just addition
// e.g. "seventy one" -> "70 1" -> "70 + 1"
$stack->push($stack->pop() + $part);
}
} else {
// Increasing step, e.g ones to hundreds
$stack->push($stack->pop() * $part);
}
} else {
// This is the first element of a new phrase
$stack->push($part);
}
// Store the last processed part
$last = $part;
}
return $sum + $stack->pop();
}
回答3:
I haven't tested this too extensively, I more or less just worked on it until I saw what I expected in the output, but it seems to work, and parses from left-to-right.
<?php
$str = 'twelve billion people know iPhone has two hundred and thirty thousand, seven hundred and eighty-three apps as well as over one million units sold';
function strlen_sort($a, $b)
{
if(strlen($a) > strlen($b))
{
return -1;
}
else if(strlen($a) < strlen($b))
{
return 1;
}
return 0;
}
$keys = array(
'one' => '1', 'two' => '2', 'three' => '3', 'four' => '4', 'five' => '5', 'six' => '6', 'seven' => '7', 'eight' => '8', 'nine' => '9',
'ten' => '10', 'eleven' => '11', 'twelve' => '12', 'thirteen' => '13', 'fourteen' => '14', 'fifteen' => '15', 'sixteen' => '16', 'seventeen' => '17', 'eighteen' => '18', 'nineteen' => '19',
'twenty' => '20', 'thirty' => '30', 'forty' => '40', 'fifty' => '50', 'sixty' => '60', 'seventy' => '70', 'eighty' => '80', 'ninety' => '90',
'hundred' => '100', 'thousand' => '1000', 'million' => '1000000', 'billion' => '1000000000'
);
preg_match_all('#((?:^|and|,| |-)*(\b' . implode('\b|\b', array_keys($keys)) . '\b))+#i', $str, $tokens);
//print_r($tokens); exit;
$tokens = $tokens[0];
usort($tokens, 'strlen_sort');
foreach($tokens as $token)
{
$token = trim(strtolower($token));
preg_match_all('#(?:(?:and|,| |-)*\b' . implode('\b|\b', array_keys($keys)) . '\b)+#', $token, $words);
$words = $words[0];
//print_r($words);
$num = '0'; $total = 0;
foreach($words as $word)
{
$word = trim($word);
$val = $keys[$word];
//echo "$val\n";
if(bccomp($val, 100) == -1)
{
$num = bcadd($num, $val);
continue;
}
else if(bccomp($val, 100) == 0)
{
$num = bcmul($num, $val);
continue;
}
$num = bcmul($num, $val);
$total = bcadd($total, $num);
$num = '0';
}
$total = bcadd($total, $num);
echo "$total:$token\n";
$str = preg_replace("#\b$token\b#i", number_format($total), $str);
}
echo "\n$str\n";
?>
回答4:
Somewhat updated El Yobo's answer, now one can run wordsToNumber function over (almost) any string containing numerals. See test below.
<?php
class Converter {
/**
* Convert numerals to digits
* @param string $input
*
* @return string
*/
public static function wordsToNumber(string $input)
{
static $delims = " \-,.!?:;\\/&\(\)\[\]";
static $tokens = [
'zero' => ['val' => '0', 'power' => 1],
'a' => ['val' => '1', 'power' => 1],
'first' => ['val' => '1', 'suffix' => 'st', 'power' => 1],
'one' => ['val' => '1', 'power' => 1],
'second' => ['val' => '2', 'suffix' => 'nd', 'power' => 1],
'two' => ['val' => '2', 'power' => 1],
'third' => ['val' => '3', 'suffix' => 'rd', 'power' => 1],
'three' => ['val' => '3', 'power' => 1],
'fourth' => ['val' => '4', 'suffix' => 'th', 'power' => 1],
'four' => ['val' => '4', 'power' => 1],
'fifth' => ['val' => '5', 'suffix' => 'th', 'power' => 1],
'five' => ['val' => '5', 'power' => 1],
'sixth' => ['val' => '6', 'suffix' => 'th', 'power' => 1],
'six' => ['val' => '6', 'power' => 1],
'seventh' => ['val' => '7', 'suffix' => 'th', 'power' => 1],
'seven' => ['val' => '7', 'power' => 1],
'eighth' => ['val' => '8', 'suffix' => 'th', 'power' => 1],
'eight' => ['val' => '8', 'power' => 1],
'ninth' => ['val' => '9', 'suffix' => 'th', 'power' => 1],
'nine' => ['val' => '9', 'power' => 1],
'tenth' => ['val' => '10', 'suffix' => 'th', 'power' => 1],
'ten' => ['val' => '10', 'power' => 10],
'eleventh' => ['val' => '11', 'suffix' => 'th', 'power' => 10],
'eleven' => ['val' => '11', 'power' => 10],
'twelveth' => ['val' => '12', 'suffix' => 'th', 'power' => 10],
'twelfth' => ['val' => '12', 'suffix' => 'th', 'power' => 10],
'twelve' => ['val' => '12', 'power' => 10],
'thirteenth' => ['val' => '13', 'suffix' => 'th', 'power' => 10],
'thirteen' => ['val' => '13', 'power' => 10],
'fourteenth' => ['val' => '14', 'suffix' => 'th', 'power' => 10],
'fourteen' => ['val' => '14', 'power' => 10],
'fifteenth' => ['val' => '15', 'suffix' => 'th', 'power' => 10],
'fifteen' => ['val' => '15', 'power' => 10],
'sixteenth' => ['val' => '16', 'suffix' => 'th', 'power' => 10],
'sixteen' => ['val' => '16', 'power' => 10],
'seventeenth' => ['val' => '17', 'suffix' => 'th', 'power' => 10],
'seventeen' => ['val' => '17', 'power' => 10],
'eighteenth' => ['val' => '18', 'suffix' => 'th', 'power' => 10],
'eighteen' => ['val' => '18', 'power' => 10],
'nineteenth' => ['val' => '19', 'suffix' => 'th', 'power' => 10],
'nineteen' => ['val' => '19', 'power' => 10],
'twentieth' => ['val' => '20', 'suffix' => 'th', 'power' => 10],
'twenty' => ['val' => '20', 'power' => 10],
'thirty' => ['val' => '30', 'power' => 10],
'forty' => ['val' => '40', 'power' => 10],
'fourty' => ['val' => '40', 'power' => 10], // common misspelling
'fifty' => ['val' => '50', 'power' => 10],
'sixty' => ['val' => '60', 'power' => 10],
'seventy' => ['val' => '70', 'power' => 10],
'eighty' => ['val' => '80', 'power' => 10],
'ninety' => ['val' => '90', 'power' => 10],
'hundred' => ['val' => '100', 'power' => 100],
'thousand' => ['val' => '1000', 'power' => 1000],
'million' => ['val' => '1000000', 'power' => 1000000],
'billion' => ['val' => '1000000000', 'power' => 1000000000],
'and' => ['val' => '', 'power' => null],
'-' => ['val' => '', 'power' => null],
];
$powers = array_column($tokens, 'power', 'val');
$mutate = function ($parts) use (&$mutate, $powers){
$stack = new \SplStack;
$sum = 0;
$last = null;
foreach ($parts as $idx => $arr) {
$part = $arr['val'];
if (!$stack->isEmpty()) {
$check = $last ?? $part;
if ((float)$stack->top() < 20 && (float)$part < 20 ?? (float)$part < $stack->top() ) { //пропускаем спец числительные
return $stack->top().(isset($parts[$idx - $stack->count()]['suffix']) ? $parts[$idx - $stack->count()]['suffix'] : '')." ".$mutate(array_slice($parts, $idx));
}
if (isset($powers[$check]) && $powers[$check] <= $arr['power'] && $arr['power'] <= 10) { //но добавляем степени (сотни, тысячи, миллионы итп)
return $stack->top().(isset($parts[$idx - $stack->count()]['suffix']) ? $parts[$idx - $stack->count()]['suffix'] : '')." ".$mutate(array_slice($parts, $idx));
}
if ($stack->top() > $part) {
if ($last >= 1000) {
$sum += $stack->pop();
$stack->push($part);
} else {
// twenty one -> "20 1" -> "20 + 1"
$stack->push($stack->pop() + (float) $part);
}
} else {
$stack->push($stack->pop() * (float) $part);
}
} else {
$stack->push($part);
}
$last = $part;
}
return $sum + $stack->pop();
};
$prepared = preg_split('/(['.$delims.'])/', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
//Замена на токены
foreach ($prepared as $idx => $word) {
if (is_array($word)) {continue;}
$maybeNumPart = trim(strtolower($word));
if (isset($tokens[$maybeNumPart])) {
$item = $tokens[$maybeNumPart];
if (isset($prepared[$idx+1])) {
$maybeDelim = $prepared[$idx+1];
if ($maybeDelim === " ") {
$item['delim'] = $maybeDelim;
unset($prepared[$idx + 1]);
} elseif ($item['power'] == null && !isset($tokens[$maybeDelim])) {
continue;
}
}
$prepared[$idx] = $item;
}
}
$result = [];
$accumulator = [];
$getNumeral = function () use ($mutate, &$accumulator, &$result) {
$last = end($accumulator);
$result[] = $mutate($accumulator).(isset($last['suffix']) ? $last['suffix'] : '').(isset($last['delim']) ? $last['delim'] : '');
$accumulator = [];
};
foreach ($prepared as $part) {
if (is_array($part)) {
$accumulator[] = $part;
} else {
if (!empty($accumulator)) {
$getNumeral();
}
$result[] = $part;
}
}
if (!empty($accumulator)) {
$getNumeral();
}
return implode('', array_filter($result));
}
}
$testStrings = [
'thirty thirty eighty one one eighty' => '30 30 81 1 80',
'twenty twenty' => '20 20',
'twelfth eleventh tenth' => '12th 11th 10th',
'ten eleven twelve' => '10 11 12',
'one two five zero' => '1 2 5 0',
'One First Two' => '1 1st 2',
'One First Two Second Three Third Four Fourth Five Fifth Six Sixth Seven' => '1 1st 2 2nd 3 3rd 4 4th 5 5th 6 6th 7',
'Bus number fifteen from bus stop number Eighty three thousand one hundred thirty nine' => 'Bus number 15 from bus stop number 83139',
'get the fifteenth cookie from fifth jar on second left shelf' => 'get the 15th cookie from 5th jar on 2nd left shelf',
'One hundred million monkeys could not write second Macbeth' => '100000000 monkeys could not write 2nd Macbeth',
'Taganskaya str. thirty two, three hundred fifty six' => 'Taganskaya str. 32, 356',
'Lenina str 56/17 b. one hundred seven' => 'Lenina str 56/17 b. 107',
'Paris & Hilton road, twenty two, house 356' => 'Paris & Hilton road, 22, house 356',
'Wien, Wilhelmstraße zwei hundert sieben und dreißig' => 'Wien, Wilhelmstraße zwei hundert sieben und dreißig',
'Vienna, Wilhelmstrasse two hundred and thirty seven' => 'Vienna, Wilhelmstrasse 237',
];
$converter = new Converter();
foreach ($testStrings as $input => $expected) {
$output = $converter::wordsToNumber($input);
echo $input."\t=>\t".$output."\n";
if ($output != $expected) { die("words to number conversion failed!");}
}
回答5:
The PEAR Numbers_Words
package is probably a good start: http://pear.php.net/package-info.php?package=Numbers_Words
来源:https://stackoverflow.com/questions/1077600/converting-words-to-numbers-in-php