Count specific pattern in URLs bigquery sql

后端 未结 3 1845
无人共我
无人共我 2021-01-26 10:45

I got a table which contains URLs and some other columns, for example dates. The URLs contain IDs, separated by different values. What the IDs have in common is that they contai

相关标签:
3条回答
  • 2021-01-26 11:21

    I see that in one of your examples IDs can be separated by / too. If you know the maximum number of "levels" (number of / in your path). You can use this query to extract all the IDs.

    with url_parts as (
        select
        date,
        split(url, '/') as parts
        from table
    )
    select
        date,
        regexp_extract_all(url_parts[safe_offset(5)], "\d") level_3,
        regexp_extract_all(url_parts[safe_offset(6)], "\d") level_4,
        ...,
        regexp_extract_all(url_parts[safe_offset(n)], "\d") level_n-2
    from  url_parts
    

    You can build on top of this to do more aggregation.

    0 讨论(0)
  • 2021-01-26 11:32

    try this one

    select
        date,
        ids_count,
        count(*) as combinations_count
    from
        (   select
                date,
                url,
                regexp_extract_all(
                    concat(
                        regexp_replace(url, r'[[:punct:]]', '~~'), '~'),
                    r'~(\d+)~') as ids,
    
                array_length(
                    regexp_extract_all(
                        concat(
                            regexp_replace(url, r'[[:punct:]]', '~~'), '~'),
                        r'~(\d+)~')) as ids_count
            from
                unnest(array[   struct(date'1999-01-01' as date, 'https://www.example.com/category1/subcategory1/71347983~7275798_fui~85092374238590235.......' as url),
                                struct(date'1999-01-02', 'https://www.example.com/category1/subcategory2/71347983_7275798/85092374238590235~773429834.......'),
                                struct(date'1999-01-01', 'https://www.example.com/category1/subcategory2/71347983_23235~773429834')])
        )
    group by
        1, 2
    
    0 讨论(0)
  • 2021-01-26 11:45

    Below is for BigQuery Standard SQL

    I'd like to construct a query that counts the amount of ID's in the URL

    #standardSQL
    SELECT date, 
      (
        SELECT COUNT(1)  
        FROM UNNEST(REGEXP_EXTRACT_ALL(url, r'[^[:punct:]]+')) part 
        WHERE NOT REGEXP_CONTAINS(part, r'[^\d]')
      ) IDs
    FROM `project.dataset.table`
    

    If to apply to sample data from your question - the output is

    Row date        IDs 
    1   01-01-1999  3        
    2   01-02-1999  4        
    3   01-02-1999  3        
    4   01-01-1999  5        
    5   01-01-1999  1        
    6   01-01-1999  1        
    

    Secondly, I'd like to group the "amounts" by date

    #standardSQL
    SELECT date, IDs, COUNT(1) combinations FROM (
      SELECT date, 
        (
          SELECT COUNT(1)  
          FROM UNNEST(REGEXP_EXTRACT_ALL(url, r'[^[:punct:]]+')) part 
          WHERE NOT REGEXP_CONTAINS(part, r'[^\d]')
        ) IDs
      FROM `project.dataset.table`
    )
    GROUP BY date, IDs   
    

    If to apply to sample data from your question - the output is

    Row date        IDs combinations     
    1   01-01-1999  3   1    
    2   01-02-1999  4   1    
    3   01-02-1999  3   1    
    4   01-01-1999  5   1    
    5   01-01-1999  1   2    
    
    0 讨论(0)
提交回复
热议问题