How to transform IP addresses into geolocation in BigQuery standard SQL?

偶尔善良 提交于 2020-01-04 00:45:19

问题


So I have read https://cloudplatform.googleblog.com/2014/03/geoip-geolocation-with-google-bigquery.html

But I was wondering if there was a #standardSQL way of doing it. So far, I have a lot of challenge converting PARSE_IP and NTH() since the suggested changes in the migration docs have limitations.

Going from PARSE_IP(contributor_ip) to NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) does not work for IPv6 IP addresses.

Going from NTH(1, latitude) lat to latitude[SAFE_ORDINAL(1)] does not work since latitude is considered a string.

And there might be more migration problems that I have yet to encounter. Does anyone know how to transform IP addresses into geolocation in BigQuery standard SQL?

P.S. How would I go from geolocation to determining timezone?

edit: So what is the difference between this

#legacySQL
SELECT
  COUNT(*) c,
  city,
  countryLabel,
  NTH(1, latitude) lat,
  NTH(1, longitude) lng
FROM (
  SELECT
    INTEGER(PARSE_IP(contributor_ip)) AS clientIpNum,
    INTEGER(PARSE_IP(contributor_ip)/(256*256)) AS classB
  FROM
    [publicdata:samples.wikipedia]
  WHERE
    contributor_ip IS NOT NULL ) AS a
JOIN EACH
  [fh-bigquery:geocode.geolite_city_bq_b2b] AS b
ON
  a.classB = b.classB
WHERE
  a.clientIpNum BETWEEN b.startIpNum
  AND b.endIpNum
  AND city != ''
GROUP BY
  city,
  countryLabel
ORDER BY
  1 DESC

and

SELECT
  COUNT(*) c,
  city,
  countryLabel,
  ANY_VALUE(latitude) lat,
  ANY_VALUE(longitude) lng
FROM (
  SELECT
    CASE
      WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) AS INT64)
      ELSE NULL
    END AS clientIpNum,
    CASE
      WHEN BYTE_LENGTH(contributor_ip) < 16 THEN SAFE_CAST(NET.IPV4_TO_INT64(NET.SAFE_IP_FROM_STRING(contributor_ip)) / (256*256) AS INT64) 
      ELSE NULL
    END AS classB
  FROM
    `publicdata.samples.wikipedia`
  WHERE
    contributor_ip IS NOT NULL ) AS a
JOIN
  `fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
  a.classB = b.classB
WHERE
  a.clientIpNum BETWEEN b.startIpNum
  AND b.endIpNum
  AND city != ''
GROUP BY
  city,
  countryLabel
ORDER BY
  1 DESC

edit2: Seems like I manage to figure out the problem via not casting a float correctly. Right now, the standard SQL returns 41815 rows instead the 56347 rows from the legacy SQL which may be due to the lack of conversion from IPv6 to int for standard SQL, but it might be due to something else. Also the legacy SQL query performs much better, running at about 10 seconds instead of the full minute from the standard SQL.


回答1:


According to https://gist.github.com/matsukaz/a145c2553a0faa59e32ad7c25e6a92f7

#standardSQL
SELECT
  id,
  IFNULL(city, 'Other') AS city,
  IFNULL(countryLabel, 'Other') AS countryLabel,
  latitude,
  longitude
FROM (
  SELECT
    id,
    NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip)) AS clientIpNum,
    TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING(ip))/(256*256)) AS classB
  FROM
    `<project>.<dataset>.log` ) AS a
LEFT OUTER JOIN
  `fh-bigquery.geocode.geolite_city_bq_b2b` AS b
ON
  a.classB = b.classB
  AND a.clientIpNum BETWEEN b.startIpNum AND b.endIpNum
ORDER BY
  id ASC


来源:https://stackoverflow.com/questions/46062598/how-to-transform-ip-addresses-into-geolocation-in-bigquery-standard-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!