Extracting strings between distinct characters using hive SQL

纵饮孤独 提交于 2021-02-05 08:28:06

问题


I have a field called geo_data_display which contains country, region and dma. The 3 values are contained between = and & characters - country between the first "=" and the first "&", region between the second "=" and the second "&" and DMA between the third "=" and the third "&". Here's a re-producible version of the table. country is always character but region and DMA can be either numeric or character and DMA doesn't exist for all countries.

A few sample values are:

country=us&region=tx&dma=625&domain=abc.net&zipcodes=76549
country=us&region=ca&dma=803&domain=abc.com&zipcodes=90404 
country=tw&region=hsz&domain=hinet.net&zipcodes=300
country=jp&region=1&dma=a&domain=hinet.net&zipcodes=300  

I have some sample SQL but the geo_dma code line isn't working at all and the geo_region code line only works for character values

SELECT 

UPPER(REGEXP_REPLACE(split(geo_data_display, '\\&')[0], 'country=', '')) AS geo_country
,UPPER(split(split(geo_data_display, '\\&')[1],'\\=')[1]) AS geo_region
,split(split(cast(geo_data_display as int), '\\&')[2],'\\=')[2] AS geo_dma
FROM mytable

回答1:


Source

regexp_extract(string subject, string pattern, int index)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 1) returns 'the'

select 
      regexp_extract(geo_data_display, 'country=(.*?)(&region)', 1),
      regexp_extract(geo_data_display, 'region=(.*?)(&dma)', 1),
      regexp_extract(geo_data_display, 'dma=(.*?)(&domain)', 1)



回答2:


You can use str_to_map like so:

select  geo_map['country']  as geo_country
       ,geo_map['region']   as geo_region
       ,geo_map['dma']      as geo_dma

from   (select  str_to_map(geo_data_display,'&','=')    as geo_map
        from    mytable
        ) t
;

+--------------+-------------+----------+
| geo_country  | geo_region  | geo_dma  |
+--------------+-------------+----------+
| us           | tx          | 625      |
| us           | ca          | 803      |
| tw           | hsz         | NULL     |
| jp           | 1           | a        |
+--------------+-------------+----------+



回答3:


Please try the following,

create table ch8(details map string,string>)

row format delimited

collection items terminated by '&'

map keys terminated by '=';

Load the data into the table.

create another table using CTAS

create table ch9 as select details["country"] as country, details["region"] as region, details["dma"] as dma, details["domain"] as domain, details["zipcodes"] as zipcode from ch8;

Select * from ch9;


来源:https://stackoverflow.com/questions/46712755/extracting-strings-between-distinct-characters-using-hive-sql

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!