Extract words from string with preg_match_all

前端 未结 7 2141
你的背包
你的背包 2020-12-21 17:14

I\'m not good with regex but i want to use it to extract words from a string.

The words i need should have minimum 4 characters and the provided string can

相关标签:
7条回答
  • 2020-12-21 17:48

    You can use the regex below for simple strings. It will match any non-whitespace characters with min length = 4.

    preg_match_all('/(\S{4,})/i', $str, $m);
    

    Now $m[1] contains the array you want.

    Update:

    As Gordon said, the pattern will also match the '(20-40)'. The unwanted numbers can be removed using this regex:

    preg_match_all('/(\pL{4,})/iu', $str, $m);
    

    But I think it only works if PCRE is compiled with UTF-8 support. See PHP PCRE (regex) doesn't support UTF-8?. It works on my computer though.

    0 讨论(0)
  • 2020-12-21 17:52
    $string = Sus azahares presentan gruesos pétalos blancos teñidos de rosa o violáceo en la parte externa, con numerosos estambres
    
    $words = explode(' ', $string);
    echo $words[0];
    echo $words[1];
    

    and so on

    0 讨论(0)
  • 2020-12-21 17:57

    Try this one:

    $str='Sus azahares presentan gruesos pétalos blancos teñidos de rosa o violáceo en la parte externa, con numerosos estambres (20-40).';
    preg_match_all('/([^0-9\s]){4,}/i', $str, $matches);
    echo '<pre>';
    var_dump($matches);
    echo '</pre>';
    
    0 讨论(0)
  • 2020-12-21 17:59

    that should do the job for you

    function extractCommonWords($string) {
        $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
    
        $string = preg_replace('/\s\s+/i', '', $string); //echo $string, "<br /><br />"; // replace whitespace
        $string = trim($string); // trim the string
        $string = preg_replace('/[^a-zA-Z0-9 -_]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
        $string = strtolower($string); // make it lowercase
    
        preg_match_all('/([a-zA-Z]|\xC3[\x80-\x96\x98-\xB6\xB8-\xBF]|\xC5[\x92\x93\xA0\xA1\xB8\xBD\xBE]){4,}/', $string, $matchWords);
        $matchWords = $matchWords[0];
    
        foreach($matchWords as $key => $item) {
            if($item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3) {
                unset($matchWords[$key]);
            }
        }
    
        $wordCountArr = array();
        if(is_array($matchWords)) {
            foreach($matchWords as $key => $val) {
                $val = strtolower($val);
                if(isset($wordCountArr[$val])) {
                    $wordCountArr[$val]++;
                } else {
                    $wordCountArr[$val] = 1;
                }
            }
        }
    
        arsort($wordCountArr);
        $wordCountArr = array_slice($wordCountArr, 0, 10);
        return $wordCountArr;
    }
    
    0 讨论(0)
  • 2020-12-21 18:01

    Explode your string with spaces (which will create an array with all words), then check if the word is bigger than 4 letters.

    //The string you want to explode
    $string = "Sus azahares presentan gruesos pétalos blancos teñidos de rosa o violáceo en la parte externa, con numerosos estambres."
    //explode your $string, which will create an array which we will call $words
    $words = explode(' ', $string);
    
    //for each $word in $words
    foreach($words as $word)
    {
        //check if $word length if larger then 4
        if(strlen($word) > 4)
        {
            //echo the $word
            echo $word;
        }
    }
    

    strlen();

    strlen — Get string length

    explode();

    explode — Split a string by string

    0 讨论(0)
  • 2020-12-21 18:05

    This works if the words to look for are UTF-8 (at least 4 chars long, as per specs), consisting of alphabetic characters of ISO-8859-15 (which is fine for Spanish, but also for English, German, French, etc.):

    $n_words = preg_match_all('/([a-zA-Z]|\xC3[\x80-\x96\x98-\xB6\xB8-\xBF]|\xC5[\x92\x93\xA0\xA1\xB8\xBD\xBE]){4,}/', $str, $match_arr);
    $word_arr = $match_arr[0];
    
    0 讨论(0)
提交回复
热议问题