How to find most popular word occurrences in MySQL?

后端未结

关注

 6  2297

I have a table called results with 5 columns.

I\'d like to use the title column to find rows that are say: WHERE title like \'%for sale%\


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-02-13 11:46
              
            
            
                                                                       
Update

Idea taken from https://stackoverflow.com/a/17942691/98491

This query works on my machine (MySQL 5.7), however Sqlfiddle reports an error.
The basic idea is that you should either create a table with numbers from 1 to maximum word occurence (like 4) in your field or as I did, use a UNION 1 .. 4 for simplicity.

CREATE TABLE products (
  `id` int,
  `name` varchar(45)
);

INSERT INTO products
    (`id`, `name`)
VALUES
    (1, 'for sale'),
    (2, 'for me'),
    (3, 'for you'),
    (4, 'you and me')
;

SELECT name, COUNT(*) as count FROM
(
SELECT
  product.id,
  SUBSTRING_INDEX(SUBSTRING_INDEX(product.name, ' ', numbers.n), ' ', -1) name
FROM
  (
    SELECT 1 AS n
    UNION SELECT 2
    UNION SELECT 3
    UNION SELECT 4
  ) AS numbers
  INNER JOIN products product
  ON CHAR_LENGTH(product.name)
     -CHAR_LENGTH(REPLACE(product.name, ' ', ''))>=numbers.n-1
ORDER BY
  id, n
)
AS result
GROUP BY name
ORDER BY count DESC


Result will be

for | 3
you | 2
me  | 2
and | 1
sale| 1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-02-13 11:51
              
            
            
                                                                       
You can use ExtractValue in some interesting way. See SQL fiddle here: http://sqlfiddle.com/#!9/0b0a0/45

We need only one table:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;


Now we construct series of selects which extract whole words from text converted to XML. Each select extracts N-th word from the text.

select words.word, count(*) as `count` from
(select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[1]') as word from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[2]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[3]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[4]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[5]') from `text`) as words
where length(words.word) > 0
group by words.word
order by `count` desc, words.word asc

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2021-02-13 11:56
              
            
            
                                                                       
This would give you single words (Just if I understand what your single word means.):

select concat(val,' ',cnt) as result from(
    select (substring_index(substring_index(t.title, ' ', n.n), ' ', -1)) val,count(*) as cnt
        from result t cross join(
         select a.n + b.n * 10 + 1 n
         from 
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) a,
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) b
                order by n 
        ) n
    where n.n <= 1 + (length(t.title) - length(replace(t.title, ' ', '')))
    group by val
    order by cnt desc
) as x


Result should be looks like this :

Result
--------
for 6
sale 6
house 2
and 2
cheap 2
phones 1
iphones 1
dogs 1
furniture 1
cars 1
androids 1
cats 1


But if the single word you need like this :

result
-----------
for 6 sale 6 house 2 and 2 cheap 2 phones 1 iphones 1 dogs 1 furniture 1 cars 1 androids 1 cats 1


Just modify the query above to:

select group_concat(concat(val,' ',cnt) separator ' ') as result from( ...

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2021-02-13 12:00
              
            
            
                                                                       
SQL is not well suited for this task, While possible there are limitations (the number of words for example)

a quick PHP script to do the same task may be easier to use long term (and likely quicker too)

<?php
$rows = [
    "cheap cars for sale",
    "house for sale",
    "cats and dogs for sale",
    "iphones and androids for sale",
    "cheap phones for sale",
    "house furniture for sale",
];

//rows here should be replaced by the SQL result
$wordTotals = [];
foreach ($rows as $row) {
   $words = explode(" ", $row);
    foreach ($words as $word) {
        if (isset($wordTotals[$word])) {
            $wordTotals[$word]++; 
            continue;
        }

        $wordTotals[$word] = 1;
    }
}

arsort($wordTotals);

foreach($wordTotals as $word => $count) {
    echo $word . " " . $count . PHP_EOL;
}


Output

for 6
sale 6
and 2
cheap 2
house 2
phones 1
androids 1
furniture 1
cats 1
cars 1
dogs 1
iphones 1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2021-02-13 12:06
              
            
            
                                                                       
You can extract words with some string manipulation.  Assuming you have a numbers table and that words are separated by single spaces:

select substring_index(substring_index(r.title, ' ', n.n), ' ', -1) as word,
       count(*)
from results r join
     numbers n
     on n.n <= length(title) - length(replace(title, ' ', '')) + 1
group by word;


If you don't have a numbers table, you can construct one manually using a subquery:

from results r join
     (select 1 as n union all select 2 union all select 3 union all . . .
     ) n
     . . .


The SQL Fiddle (courtesy of @GrzegorzAdamKowalski) is here.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2021-02-13 12:09
              
            
            
                                                                       
Here is working SQL Fiddle: http://sqlfiddle.com/#!9/0b0a0/32

Let's start with two tables - one for texts and one for numbers:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text
    (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

CREATE TABLE iterator (`index` int);

INSERT INTO iterator
    (`index`)
VALUES
    (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),
    (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
;


The second table, iterator must contains numbers from 1 to N where N higher or equal to the lenght of the longest string in text.

Then, run this query:

select
  words.word, count(*) as `count`
from 
(select
  substring(concat(' ', t.title, ' '), i.index+1, j.index-i.index) as word
from
  text as t, iterator as i, iterator as j
where
    substring(concat(' ', t.title), i.index, 1) = ' '
and substring(concat(t.title, ' '), j.index, 1) = ' '
and i.index < j.index
) AS words
where
    length(words.word) > 0
and words.word not like '% %'
group by words.word
order by `count` desc, words.word asc


There are two selects. Outer one simply groups and counts single words (words of length greater than 0 and without any spaces). Inner one extracts all strings starting from any space character and ending with any other space character, so strings aren't words (despite naming this subquery words) because they can contain other spaces than starting and ending one.

Results:

word    count
for     6
sale    6
and     2
cheap   2
house   2
androids    1
cars    1
cats    1
dogs    1
furniture   1
iphones     1
phones  1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复