BigQuery JavaScript UDF process - per row or per processing node?

北城余情 提交于 2020-06-27 05:21:05

问题


I'm thinking of using BigQuery's JavaScript UDF as a critical component in a new data architecture. It would be used to logically process each row loaded into the main table, and also to process each row during periodical and ad-hoc aggregation queries.

Using an SQL UDF for the same purpose seems to be unfeasible because each row represents a complex object, and implementing the business logic in SQL, including things such as parsing complex text fields, gets ugly very fast.

I just read the following in the Optimizing query computation documentation page:

Best practice: Avoid using JavaScript user-defined functions. Use native UDFs instead.

Calling a JavaScript UDF requires the instantiation of a subprocess. Spinning up this process and running the UDF directly impacts query performance. If possible, use a native (SQL) UDF instead.

I understand why a new process for each processing node is needed, and I know that JS tends to be deployed in a single-thread-per-process manner (even though v8 does support multithreading these days). But it's not clear to me if once a JS runtime process is up, it can be expected to get reused between calls to the same function (e.g. for processing different rows on the same processing node). The amount of reuse will probably significantly affect the cost. My table is not that large (tens to hundreds of millions of rows), but still I need to have a better understanding here.

I could not find any authoritative source on this. Has anybody done any analysis of the actual impact of using a JavaScript UDF on each processed row, in terms of execution time and cost?


回答1:


If it's not documented, then that's an implementation detail that could change. But let's test it:

CREATE TEMP FUNCTION randomThis(views INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  if (typeof variable === 'undefined') {
     variable = Math.random()
  }
  return variable
""";

SELECT randomThis(views), COUNT(*) c
FROM (
  SELECT views
  FROM `fh-bigquery.wikipedia_v3.pageviews_2019` 
  LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC

I was expecting ten million different numbers, or a handful, but I only got one: The same process was reused ten million times, and variables were kept around in between calls.

This even happened when I went up to 100 million, signaling that parallelism is bounded by one JS VM.

Again, these are implementation details that could change. But while it stays that way, you can make the best use out of it.




回答2:


I was expecting ten million different numbers, or a handful, but I only got one

That's because you didn't allow Math.random to be called more than once

and variables were kept around in between calls

due to the variable defined at the global scope.

In other words your code explicitly permits Math.random to be executed once only (by implictly defining the variable at the global scope).

If you try this:

CREATE TEMP FUNCTION randomThis(seed INT64)
RETURNS FLOAT64
LANGUAGE js AS """
  let ret = undefined
  if (ret === undefined) {
     ret = Math.random()
  }
  return ret
""";

SELECT randomThis(size), COUNT(*) c
FROM (
  SELECT repository_size as size
  FROM `my-internal-dataset.sample-github-table` 
  LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC

then you get many rows. And now it does take much longer time to execute, probably because the single VM became a bottleneck.

Used another dataset to reduce the query cost.

Conclusion:
1. There is one VM (or maybe a container) per query to support JS UDF. This is in line with a single subprocess ("Calling a JavaScript UDF requires the instantiation of a subprocess") mentioned in the documentation.
2. If you can apply execute-once pattern (using some kind of a cache or coding technique like memoisation) and write a UDF similar to the previous answer, then the sheer presence of JS UDF has a limited impact on your query.
3. If you have to write a JS UDF like in this answer, then the impact on your query becomes very significant with query execution time skyrocketing even for simple JS code. So for this case it's certainly better to stay out.



来源:https://stackoverflow.com/questions/59430104/bigquery-javascript-udf-process-per-row-or-per-processing-node

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!