I've got four MySQL tables:
users (id, name)
polls (id, text)
options (id, poll_id, text)
responses (id, poll_id, option_id, user_id)
Given a particular poll and a particular option, I'd like to generate a table that shows which options from other polls are most strongly correlated.
Suppose this is our data set:
TABLE users: +------+-------+ | id | name | +------+-------+ | 1 | Abe | | 2 | Bob | | 3 | Che | | 4 | Den | +------+-------+ TABLE polls: +------+-----------------------+ | id | text | +------+-----------------------+ | 1 | Do you like apples? | | 2 | What is your gender? | | 3 | What is your height? | | 4 | Do you like polls? | +------+-----------------------+ TABLE options: +------+----------+---------+ | id | poll_id | text | +------+----------+---------+ | 1 | 1 | Yes | | 2 | 1 | No | | 3 | 2 | Male | | 4 | 2 | Female | | 5 | 3 | Short | | 6 | 3 | Tall | | 7 | 4 | Yes | | 8 | 4 | No | +------+----------+---------+ TABLE responses: +------+----------+------------+----------+ | id | poll_id | option_id | user_id | +------+----------+------------+----------+ | 1 | 1 | 1 | 1 | | 2 | 1 | 2 | 2 | | 3 | 1 | 2 | 3 | | 4 | 1 | 2 | 4 | | 5 | 2 | 3 | 1 | | 6 | 2 | 3 | 2 | | 7 | 2 | 3 | 3 | | 8 | 2 | 4 | 4 | | 9 | 3 | 5 | 1 | | 10 | 3 | 6 | 2 | | 10 | 3 | 5 | 3 | | 10 | 3 | 6 | 4 | | 10 | 4 | 7 | 1 | | 10 | 4 | 7 | 2 | | 10 | 4 | 7 | 3 | | 10 | 4 | 7 | 4 | +------+----------+------------+----------+
Given the poll ID 1 and the option ID 2, the generated table should be something like this:
+----------+------------+-----------------------+ | poll_id | option_id | percent_correlated | +----------+------------+-----------------------+ | 4 | 7 | 100 | | 2 | 3 | 66.66 | | 3 | 6 | 66.66 | | 2 | 4 | 33.33 | | 3 | 5 | 33.33 | | 4 | 8 | 0 | +----------+------------+-----------------------+
So basically, we're identifying all of the users who responded to poll ID 1 and selected option ID 2, and we're looking through all the other polls to see what percentage of them also selected each other option.