For instance, if I have data in a column like this
data
I love book
I love apple
I love book
I hate apple
I hate apple
How can I get result
create a user defined function like this and use it in your query
DELIMITER $$
CREATE FUNCTION `getCount`(myStr VARCHAR(1000), myword VARCHAR(100))
RETURNS INT
BEGIN
DECLARE cnt INT DEFAULT 0;
DECLARE result INT DEFAULT 1;
WHILE (result > 0) DO
SET result = INSTR(myStr, myword);
IF(result > 0) THEN
SET cnt = cnt + 1;
SET myStr = SUBSTRING(myStr, result + LENGTH(myword));
END IF;
END WHILE;
RETURN cnt;
END$$
DELIMITER ;
Hope it helps Refer This
If you want to perform such kind of text analysis, I would recommend using something like lucene, to get the termcount for each term in the document.
Split-string procedure is not my job. You can find it here
http://forge.mysql.com/tools/tool.php?id=4
I wrote you the rest of code.
drop table if exists mytable;
create table mytable (
id int not null auto_increment primary key,
mytext varchar(1000)
) engine = myisam;
insert into mytable (mytext)
values ('I love book,but book sucks!What do you,think about it? me too'),('I love apple! it rulez.,No, it sucks a lot!!!'),('I love book'),('I hate apple!!! Me too.,!'),('I hate apple');
drop table if exists mywords;
create table mywords (
id int not null auto_increment primary key,
word varchar(50)
) engine = myisam;
delimiter //
drop procedure if exists split_string //
create procedure split_string (
in input text
, in `delimiter` varchar(10)
)
sql security invoker
begin
declare cur_position int default 1 ;
declare remainder text;
declare cur_string varchar(1000);
declare delimiter_length tinyint unsigned;
drop temporary table if exists SplitValues;
create temporary table SplitValues (
value varchar(1000) not null
) engine=myisam;
set remainder = input;
set delimiter_length = char_length(delimiter);
while char_length(remainder) > 0 and cur_position > 0 do
set cur_position = instr(remainder, `delimiter`);
if cur_position = 0 then
set cur_string = remainder;
else
set cur_string = left(remainder, cur_position - 1);
end if;
if trim(cur_string) != '' then
insert into SplitValues values (cur_string);
end if;
set remainder = substring(remainder, cur_position + delimiter_length);
end while;
end //
delimiter ;
delimiter //
drop procedure if exists single_words//
create procedure single_words()
begin
declare finish int default 0;
declare str varchar(200);
declare cur_table cursor for select replace(replace(replace(replace(mytext,'!',' '),',',' '),'.',' '),'?',' ') from mytable;
declare continue handler for not found set finish = 1;
truncate table mywords;
open cur_table;
my_loop:loop
fetch cur_table into str;
if finish = 1 then
leave my_loop;
end if;
call split_string(str,' ');
insert into mywords (word) select * from splitvalues;
end loop;
close cur_table;
end;//
delimiter ;
call single_words();
select word,count(*) as word_count
from mywords
group by word;
+-------+------------+
| word | word_count |
+-------+------------+
| a | 1 |
| about | 1 |
| apple | 3 |
| book | 3 |
| but | 1 |
| do | 1 |
| hate | 2 |
| I | 5 |
| it | 3 |
| lot | 1 |
| love | 3 |
| me | 2 |
| No | 1 |
| rulez | 1 |
| sucks | 2 |
| think | 1 |
| too | 2 |
| What | 1 |
| you | 1 |
+-------+------------+
19 rows in set (0.00 sec)
The code must be improved in order to consider any punctuation but this is the general idea.
This query is going to take a long time to run if your table is of any decent size. It may be better to keep track of the counts in a separate table and update that table as values are inserted or, if real time results are not necessary, to only run this query every so often to update the counts table and pull your data from it. That way, you're not spending minutes to get data from this complex query.
Here's what I've for you so far. It's a good start. The only thing you need to do is modify it to iterate through the words in each row. You could use a cursor or a subquery.
Create test table:
create table tbl(str varchar(100) );
insert into tbl values('data');
insert into tbl values('I love book');
insert into tbl values('I love apple');
insert into tbl values('I love book');
insert into tbl values('I hate apple');
insert into tbl values('I hate apple');
Pull data from test table:
SELECT DISTINCT str AS Word, COUNT(str) AS Frequency FROM tbl GROUP BY str;
Here is a solution only using a query:
SELECT SUM(total_count) as total, value
FROM (
SELECT count(*) AS total_count, REPLACE(REPLACE(REPLACE(x.value,'?',''),'.',''),'!','') as value
FROM (
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(t.sentence, ' ', n.n), ' ', -1) value
FROM table_name t CROSS JOIN
(
SELECT a.N + b.N * 10 + 1 n
FROM
(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
ORDER BY n
) n
WHERE n.n <= 1 + (LENGTH(t.sentence) - LENGTH(REPLACE(t.sentence, ' ', '')))
ORDER BY value
) AS x
GROUP BY x.value
) AS y
GROUP BY value
Here is the full working fiddle: http://sqlfiddle.com/#!2/17481a/1
First we do a query to extract all words as explained here by @peterm(follow his instructions if you want to customize the total number of words processed). Then we convert that into a sub-query and then we COUNT
and GROUP BY
the value of each word, and then make another query on top of that to GROUP BY
not grouped words cases where accompanied signs might be present. ie: hello = hello! with a REPLACE