Advanced MySQL: Find correlations between poll responses

匿名 (未验证) 提交于 2019-12-03 08:30:34

问题:

I've got four MySQL tables:

users (id, name)
polls (id, text)
options (id, poll_id, text)
responses (id, poll_id, option_id, user_id)

Given a particular poll and a particular option, I'd like to generate a table that shows which options from other polls are most strongly correlated.

Suppose this is our data set:

TABLE users: +------+-------+ | id   | name  | +------+-------+ |    1 | Abe   | |    2 | Bob   | |    3 | Che   | |    4 | Den   | +------+-------+  TABLE polls: +------+-----------------------+ | id   | text                  | +------+-----------------------+ |    1 | Do you like apples?   | |    2 | What is your gender?  | |    3 | What is your height?  | |    4 | Do you like polls?    | +------+-----------------------+  TABLE options:  +------+----------+---------+ | id   | poll_id  | text    | +------+----------+---------+ |    1 | 1        | Yes     | |    2 | 1        | No      | |    3 | 2        | Male    | |    4 | 2        | Female  | |    5 | 3        | Short   | |    6 | 3        | Tall    | |    7 | 4        | Yes     | |    8 | 4        | No      | +------+----------+---------+  TABLE responses:  +------+----------+------------+----------+ | id   | poll_id  | option_id  | user_id  | +------+----------+------------+----------+ |    1 | 1        | 1          | 1        | |    2 | 1        | 2          | 2        | |    3 | 1        | 2          | 3        | |    4 | 1        | 2          | 4        | |    5 | 2        | 3          | 1        | |    6 | 2        | 3          | 2        | |    7 | 2        | 3          | 3        | |    8 | 2        | 4          | 4        | |    9 | 3        | 5          | 1        | |   10 | 3        | 6          | 2        | |   10 | 3        | 5          | 3        | |   10 | 3        | 6          | 4        | |   10 | 4        | 7          | 1        | |   10 | 4        | 7          | 2        | |   10 | 4        | 7          | 3        | |   10 | 4        | 7          | 4        | +------+----------+------------+----------+ 

Given the poll ID 1 and the option ID 2, the generated table should be something like this:

+----------+------------+-----------------------+ | poll_id  | option_id  | percent_correlated    | +----------+------------+-----------------------+ | 4        | 7          | 100                   | | 2        | 3          | 66.66                 | | 3        | 6          | 66.66                 | | 2        | 4          | 33.33                 | | 3        | 5          | 33.33                 | | 4        | 8          | 0                     | +----------+------------+-----------------------+ 

So basically, we're identifying all of the users who responded to poll ID 1 and selected option ID 2, and we're looking through all the other polls to see what percentage of them also selected each other option.

回答1:

Don't have an instance handy to test, can you see if this gets proper results:

select         poll_id,         option_id,         ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,         n from (     select          poll_id,         option_id,         SUM(score) AS sum1,         SUM(score_rev) AS sum2,         SUM(score * score) AS sum1sq,         SUM(score_rev * score_rev) AS sum2sq,         SUM(score * score_rev) AS psum,         COUNT(*) AS n     from     (             select                  responses.poll_id,                  responses.option_id,                 CASE                      WHEN user_resp.user_id IS NULL THEN SELECT 0                     ELSE SELECT 1                 END CASE as score,                 CASE                      WHEN user_resp.user_id IS NULL THEN SELECT 1                     ELSE SELECT 0                 END CASE as score_rev,             from responses left outer join                      (                         select                              user_id                         from                              responses                          where                             poll_id = 1 and                              option_id = 2                     )user_resp                           ON (user_resp.user_id = responses.user_id)     ) temp1      group by         poll_id,         option_id )components  


回答2:

After a few hours of trial and error, I managed to put together a query that works correctly:

SELECT poll_id AS p_id,         option_id AS o_id,         COUNT(*) AS optCount,          (SELECT COUNT(*) FROM response WHERE option_id = o_id AND user_id IN            (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2')) /        (SELECT COUNT(*) FROM response WHERE poll_id = p_id  AND user_id IN            (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2'))         AS percentage   FROM response  INNER JOIN     (SELECT user_id FROM response WHERE poll_id = '1' AND option_id = '2') AS user_ids ON response.user_id = user_ids.user_id WHERE poll_id != '1'   GROUP BY option_id DESC  ORDER BY percentage DESC, optCount DESC 

Based on a tests with a small data set, this query looks to be reasonably fast, but I'd like to modify it so the "IN" subquery is not repeated three times. Any suggestions?



回答3:

This seems to give the right results for me:

select poll_stats.poll_id,        option_stats.option_id,        (100 * option_responses / poll_responses) as percent_correlated from (select response.poll_id,              count(*) as poll_responses       from response selecting_response            join response on response.user_id = selecting_response.user_id       where selecting_response.poll_id = 1 and selecting_response.option_id = 2       group by response.poll_id) poll_stats       join (select options.poll_id,                    options.id as option_id,                    count(response.id) as option_responses             from options                  left join response on response.poll_id = options.poll_id                            and response.option_id = options.id                            and exists (                             select 1 from response selecting_response                             where selecting_response.user_id = response.user_id                                   and selecting_response.poll_id = 1                                   and selecting_response.option_id = 2)             group by options.poll_id, options.id            ) as option_stats        on option_stats.poll_id = poll_stats.poll_id where poll_stats.poll_id <> 1 order by 3 desc, option_responses desc 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!