How to create dummy variable columns for thousands of categories in Google BigQuery?

放肆的年华 提交于 2019-11-26 08:34:16

问题


I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to \"dummify\" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to \"dummify\" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.


回答1:


You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1    


来源:https://stackoverflow.com/questions/34010002/how-to-create-dummy-variable-columns-for-thousands-of-categories-in-google-bigqu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!