Recently, we developed a consumer facing cloud managed IoT backend for a client using BigQuery SQL. One issue that arose was how do end users access their own data? BigQuery is a columnar database, and hence, querying for one users data usually involves scanning all users' data, which is obviously slower and more expensive than necessary.
Thus we made the decision to serve end users data through a Redis Key-Value cache hosted in Memorystore which is designed for fast (<10ms) and economical lookups. However, how do we export 100K computed rows from BigQuery to Memorystore quickly?
We experimented with many approaches (Dataflow, Bigquery API, App Engine Flex, GCF) before finding the right combination of technologies to do this effectively. Hopefully this post will save you a lot of time if you also need to do this!
First, let's talk about what didn't work. A Dataflow batch job didn't because it has a 1 minute spin up time, significantly impacting data freshness latency. Google App Engine (GAE) doesn't work either because the native Redis client is incompatible with the environment. GAE Flex does work technically, but you end up over provisioned (minimum of 1 instance at all times) so it's a bit expensive, especially if you want it to minimize latency and use a large instance type.
Google Cloud Functions are a perfect fit, they can run Redis clients and scale to zero! Perfect for spikey workloads such as periodic report dumps. BUT! access to Memorystore is restricted by VPC, which, until recently, would have been a blocker. But now we have the serverless-vpc-connecter allowing the GCF environment to access private networks. So with that technology we were able to build an awesome fully managed event driven general purpose REDIS loader, that scales to zero when not in use.
Implementation
The quickest method we found to get data out of BigQuery is an export to Cloud Storage Bucket. Data moves through specially optimized managed pipes and therefore takes just a few seconds to export 100k rows. This is 20x faster than using the BigQuery client (1k rows per second). In our implementation we used the JSON export to format which supports nested fields. Furthermore, we used an additional column as metadata to tell the bulk loader what redis key to index under.
The process of exporting to a bucket fires the "object.change" storage trigger, which we configured to invoke a Google Cloud Function. The Function uses the Google Cloud Storage client to stream the contents to a redis client.
We found maximizing the Google Cloud Function memory resources was a simple and free optimization. Yes, the cost per second increases with memory allocation, but the high memory functions also complete the tasks faster and so total running time was less. Overall the costs stay the same but the high memory functions complete their tasks quicker. With no serious optimization we managed 7K rows per second.
The solution is general, in the sense that, any JSON newline delimited file dropped in the bucket will get loaded into Redis. The fact that in our system the data came from Bigquery is incidental. It's only a few lines of code, but it took us a week of experimentation to perfect. Enjoy!
Google Cloud Function
// Trigger by GCS object.change event.
exports.cacheload = (info, context) => new Promise((resolve, reject) => {
const bucket = gcs.bucket(info.bucket);
const file = bucket.file(info.name);
let keysWritten = 0;
file.createReadStream()
.on('error', error => reject(error))
.on('response', (response) => {
// connection to GCS opened
}).pipe(split()) // convert to lines
.on('data', function (line) {
if (!line || line === "") return;
keysWritten++;
const data = JSON.parse(line);
redis.set(data.key, line, 'EX', process.env.EXPIRY, redis.print);
})
.on('end', () => {
console.log(`Keys written: ${keysWritten}`);
resolve();
})
.on('error', error => reject(error));
});
- Tom LarkworthySenior Cloud Architect