Automatically transfer Cloud Storage logs to BigQuery


There is a function to output Cloud Storage access logs to Cloud Storage.

This document also explains the procedure for prodding the log into BigQuery, but it is done by downloading it locally and processing it locally, but if you make full use of the function of BigQuery, such a hassle is unnecessary. Also, it is not good that the partitioning procedure is missing to rush into BigQuery.

I mean, the documentation recommends Cloud Audit Logs rather than spitting logs on Cloud Storage. However, this setting is limited that you will not be able to access Anonymous (bug?) There is.

Even if there is no such restriction, BigQuery's SQL is relatively rich, so it will be somewhat faster if you stick to BigQuery. Since BigQuery has a function to transfer CSV on Cloud Storage, explain how to utilize it.

Logging on Cloud Storage

See here for instructions This will output the access log to the specified bucket.

Transfer to BigQuery

There is a procedure to flow into BigQuery on the same page, but only borrow the schema configuration file (cloud_storage_usage_schema_v0.json).

BigQuery has a transparently sharding feature called partitioning. You can narrow down the scope of the query and keep the cost down, but if you don't do this, it will be a big deal if the amount of data increases.

It looks good to partition by creation time, but since the csv time_micros of the generated log is integer, "time unit column partitioning" can not be done. So, in order to partition by import time, I created a table with the following command.

bq mk \\
   --time_partitioning_type DAY \\
   --table storageanalysis.usage \\

After that, you can transfer from Storage to BigQuery by this procedure.

Parsing with BigQuery

When the data comes in, you can throw in SQL, but it is painful to parse the URL path with SQL. BigQuery has a function that allows UDFs to be written in JS, so you can use it easily.

For example, if you throw the following SQL:

    LANGUAGE js AS r"""
        const [a, b, c] = path?.split("/") ?? [];
        return {a, b, c};

SELECT parsePath(cs_object).* FROM `storageanalysis.usage` WHERE DATE(_PARTITIONTIME) = "2022-04-18";

This allows you to assign the first to third tiers of key to the fields a, b, and c, respectively. BigQuery can also handle records and arrays. In the previous example, parsePath returns a record with fields a, b, and c, and expands the fields in the .*.

Author by

Updated on April 17, 2022