Mastering BigQuery Costs

8 Jan 2020|Technology

BigQuery is an amazing tool, it is capable of summoning an army to crunch your numbers in an instant. However, this power is also a footgun, as you can easily execute very expensive queries and not notice until the end-of-the-month bill arrives. This is a professionally unacceptable situation to be in, so I developed this list of best practices to ensure BigQuery projects are delivered in a safe state.

Look at the cost before running the query

If you use the web console, the (data) cost of the query will flash up in green before you execute it. Always check this. Always be mindful of how much a query is about to cost. Anything that processes less than 10Gb is essentially trivial.

Set billing alerts

It's trivial to get GCP to email the billing admins/project owners when spending hits a milestone. Use this feature to catch accidents before the end of the month, and early enough to take mitigating action. Set the alert at a level you never expect to reach, so that you never get in the habit of ignoring those emails. Find the option in Billing > linked accounts > Budgets and Alerts

To impose hard limits use custom quotas.

Visualize your spend in Stackdriver

You should regularly check your production spend is what you expect, particularly after deployments. You can visualize BigQuery spend using the metrics explorer, look for the Resource: "BigQuery Project" and the metric "Statement scanned bytes billed". If you Group By statement_type, you can disambiguate between the top-level SQL statement type (e.g. REPLACE or SELECT). This is handy if you spend is high and you need to narrow where in your system you are incurring costs, though a much better approach is labeling the queries.

Label your queries

If you discover you are spending a fortune in production, you have to actually figure out which query is the culprit. This can be very difficult! So best practice is to name each query with a unique label. You can do this in code at job creation time:

    bigquery.createQueryJob({
        query: …
        location: …
        labels: {name: 'Select_foos'}
    });

You can then group or filter by these labels in the Stackdriver metrics view.

Use a fat table design

BigQuery is a column store, understand this. Its performance is very different from a row store like MySQL or Postgres. If you use it like a row store … you will be burned! The main difference is that if a column needs to be accessed to calculate a result set, every single row for that column will be accessed. Your costs are a factor of the number of rows multiplied by the columns accessed (so also never do SELECT *).

Typically, you will want a fat-table design, where you minimize the number of rows by either creating many tables or have many columns per table. A good domain that demonstrates the difference is timeseries metrics collection.

For a row orientated database I might use a single general-purpose table of (timestamp, metric, value) and create an INDEX on the metric type. This schema does not work well for a column store, as it's impossible to access values for a metric without enumerating all values in the table. You generally want to access just a few specific metrics, but a column store has to scan everything everytime… and it gets expensive.

For BigQuery, I would have a table dedicated to each metric type of format (timestamp, value). Thus your queries would have to target specific tables and they won't be accessing data in other tables.

Partition your tables

Continuing with the example of a timeseries of metric values, when querying a narrow time range, you might realize that the whole dataset gets read. This means that, as your table fills up over time, your query costs grow linearly with history! Obviously, this is untenable, so BigQuery allows you to "chunk" a table by a time-based partition key.

To exploit the chunks, you need a WHERE clause that includes static bounds on the partition key, then the query planner knows which partitions might contribute to the result set and it will only scan those. Partitioning is critical to keeping costs bounded.

Enforce the use of partitions

In practice, especially during debugging, it is very easy to forget to include a partition bound, and accidentally run a full table scan on a partitioned table. If you have been in production for a while, this could be very expensive!

Also when developing complex queries, sometimes you are not even sure the query planner can exploit the partition structure or not. This is why I also consider it essential to have BigQuery reject full table scans by using the require_partitioning_filter flag on the table definition. This will prevent queries from running unless they are using the partitioning structure. You can still do full table scans by setting a very wide temporal range, so it doesn't lose expressivity, but it prevents accidents caused by omissions.

Partition gotcha - out of range timestamps

On one project I followed all the best practices suggested above, but we still saw linear growth in costs over time! Why the hell wasn't the partitioning working?

Turned out the timestamps came from Android time clocks and had ridiculous values such as the year 3154. BigQuery only allows streaming into tables 1 year in the past and 6 months into the future. Our out-of-bound values got written to the UNPARTITIONED table which is scanned for every subsequent query. Of course, more of these corrupt timestamps were added at roughly the same rate each day, and thus query costs grew linearly over time!

So to be really sure you are always taking advantage of partitioning behavior, you need to also validate that timestamps are somewhat close to the current time.

Cluster you tables

An additional hammer to smash is costs is using clustering on top of partitioning, which colocates data based on values in a column ordering you provide.

This prevents full tables scans on partitions, if the query can exploit the clustering. For example, if you table is list of timestamped user actions, you might cluster by user_id, then the following query becomes cheap.

SELECT * FROM userActions 
         WHERE user_id = ‘foo’
         AND timestamp < CURRENT_TIMESTAMP() 
         ORDER BY timestamp DESC
         LIMIT 1

Excellent for multitenancy.

Note adding this index is a trick you can only apply once to a table. (Shoutout and gratitude to Felipe Hoffa for teaching me this)

Export to a row store

Sometimes you really want to access single rows and use multiple indexes, but BigQuery has no reasonable way of doing this economically. You might be dissatisfied with query overhead latency (think 500ms). In this case you might want to export whole tables or partitions to other storage systems.

BigQuery has a very fast export to Google Cloud Storage path though, so you can use the export-to-GCS-bucket as part of a syncing pattern to other storage systems. One good option is exporting to Cloud Memorystore (Redis) for serving end-user queries, we have described how to do this elsewhere.

Go forth and BigQuery

BigQuery is a fantastic technology, its ability to scale up horizontally and process huge batches enables new and exciting data-driven architectures. However, it's a financial footgun and should be used carefully to avoid surprise bills.

To de-risk adopting BigQuery:

Ensure management is receiving billing alerts by setting the budget threshold very low before setting it to its true value
Agree with all team members that, require_partition_filter is mandatory for all timeseries tables
Team lead should double-check they can see the Stackdriver billing graphs.

Once those things are in place, you can learn the rest as you go. Enjoy the power. Use responsibly!

Author

Tom Larkworthy
Senior Cloud Architect