Backfill GA4 Data in BigQuery

Published on July 12, 2024 by Tanya Zyabkina

GA4 BigQuery backfill ensures comprehensive historical data analysis. This guide walks you through backfilling GA4 data into BigQuery for data continuity and in-depth historical analysis.

Raw GA4 export to BigQuery is not retroactive, and backfilling historical GA4 data is often a necessary step in building a digital analytics data warehouse and reporting system.

Using API for backfilling the data requires technical skills. However, there are other ways to export Google Analytics data, which are outlined in this article.

This article is a tutorial on exporting the data using GA4 data API and loading it into BigQuery.

GA4 data API exports summarized tables, so you must know your desired metrics and dimensions beforehand. Unlike the raw data export to BigQuery, the API export requires understanding the requirements of your eventual staging tables.

1. Enable GA4 data API and set service account permissions

Before starting with data backfill, it is essential to prepare your Google Cloud project, enable the GA4 Data API, and create a Service Account with the necessary permissions.

If you have never worked with Google Cloud Platform, you may benefit from Google’s Quickstart process that creates a new GCP project, service account (named quickstart@), and downloads the API keys in one click of a button. You will still need to give your service account permissions to access GA4 and BigQuery shown below.

Project, API, and Service Account

Google Cloud Project: Ensure you have a Google Cloud project set up. If not, you can create one from the Google Cloud Console.

Enable GA4 Data API: Navigate to the APIs & Services section in your Google Cloud Console. Enable the GA4 Data API by searching for “Google Analytics Data API” and clicking the enable button.

Create Service Account and API key: Create a Service Account within your project to manage the permissions for accessing GA4 data and BigQuery.
- In the APIs & Services section in the Google Cloud Console click on Credentials and then on Create Credentials -> Service Account.
- Provide a name and description for the Service Account.

- Once the service account is created, click on its name and choose KEYS in the top row. Click on Add Key s at the bottom, and choose Create New Key.

- Choose JSON key format, and it will be downloaded automatically.
- While on the Service Account screen, copy its email address. You will need it for the next step.

Permissions for Service Account

Google Analytics

Open the GA4 property that you are backfilling and navigate to Admin (bottom left).
Under the Property Setting, click on Property and then on Property Access Management.
Add your Service Account email and assign “Viewer” role to ensure data read permissions.

BigQuery access for your service account

In your Google Cloud project, go to the IAM & Admin -> IAM section.
Choose Grant Permissions options, and add your service account as a principal with the following roles:
- BigQuery Data Editor
- BigQuery Job User

With the project and permissions set up, you’re ready to proceed with exporting GA4 data using the API.

2. Export GA4 data using API

To export GA4 data, we will use Python in Google Colab. This environment simplifies authentication and enables interactive data processing. Google Colab offers an easy setup for running Python scripts that interact with Google APIs. It allows you to leverage Google’s libraries and credentials storage mechanisms seamlessly.

Authorize Service Account in Python/Colab and Install Packages

To give Colab access to your services account, open the left panel by clicking on key icon, choose Add new secret, and then a new line appears in list of secret keys. Name your key something descriptive and copy the contents of the whole JSON API ley file into Value field. Toggle the Notebook access on for your notebook to access the key.

Next, install the necessary packages and import required libraries:

!pip install google-analytics-data

Import these libraries to interact with Google Analytics Data API and process data for your BigQuery backfill.

import numpy as np
import pandas as pd
from google.colab import userdata
import json
from google.oauth2 import service_account
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import DateRange
from google.analytics.data_v1beta.types import Dimension
from google.analytics.data_v1beta.types import Metric
from google.analytics.data_v1beta.types import RunReportRequest
from google.analytics.data_v1beta.types import OrderBy

To access your Google Analytics property through the API, you will need your property ID, which you can find in Google Analytics interface, but accessing Admin -> Property -> Property Details (in the upper right corner).

## Set up global variables
property_id = ‘0000000’
response_limit = 10000
credentials_json = json.loads(userdata.get(‘your_key_name’))

These variables are crucial for GA4 BigQuery backfill. They identify your GA4 property, manage API call limits, and store service account credentials.

Construct API request, format output, and set pagination

This request structure specifies the timeframe and variables in your GA4 BigQuery backfill output. Adjust these fields to fit your needs using full list of metrics and dimensions available in GA4 Data API.

request = RunReportRequest(
        dimensions=[
            Dimension(name=“date”),
            Dimension(name=“firstUserSource”),
            Dimension(name=“firstUserMedium”),
            Dimension(name=“landingPage”),
        ],
        metrics=[
            Metric(name=“sessions”),
        ],
        date_ranges=[DateRange(start_date=“2023-07-01”, end_date=“today”)],
    )

These functions manage large datasets during GA4 BigQuery backfill, handling pagination and formatting the data into neat dataframes ready for table loading.

def format_response(response):
    # Row index
    row_index_names = [header.name for header in response.dimension_headers]
    row_header = []
    for i in range(len(row_index_names)):
        row_header.append([row.dimension_values[i].value for row in response.rows])

    row_index_named = pd.MultiIndex.from_arrays(np.array(row_header), names = np.array(row_index_names))
    # Row flat data
    metric_names = [header.name for header in response.metric_headers]
    data_values = []
    for i in range(len(metric_names)):
        data_values.append([row.metric_values[i].value for row in response.rows])

    output = pd.DataFrame(data = np.transpose(np.array(data_values, dtype = ‘f’)),
                          index = row_index_named, columns = metric_names)
    return output

def run_report(request, property_id, credentials_json, page_size=10000):
    #Create service credentials
    credentials = service_account.Credentials.from_service_account_info(credentials_json)
    # Update Request
    request.property = ‘properties/’+property_id
    request.limit = page_size
    #Create a client
    client = BetaAnalyticsDataClient(credentials = credentials)

    #Get GA4 data
    page_token = 0
    response_size = page_size
    response = client.run_report(request)
    total_rows = response.row_count
    df = format_response(response)

    while response_size < total_rows:
      request.offset = page_token + page_size
      response = client.run_report(request)
      df = pd.concat([df, format_response(response)])
      page_token += page_size
      response_size += page_size

    df.sort_index(inplace = True)
    return df

To get the data from GA4 and format it into a dataframe, call the run_report finction, specifying your request, property id, service account credentials, and the maximum page size.

formatted_reponse = run_report(request, property_id, credentials_json, response_limit)

3. Load data into BigQuery

Once data is exported and formatted, the next step is to load it into BigQuery. This is a helpful Google guide on the Python BigQuery client for your reference.

from google.cloud import bigquery
 
BQ_DATASET_NAME = ‘ga4_backfill’
BQ_TABLE_NAME = ‘test_table’

If you have not created your backfill dataset (BigQuery term for data schema), you can build it using this function. More on creating BigQuery datasets using python.

def create_bq_dataset(dataset_id):
  dataset_id_full = f“{client.project}.{dataset_id}“
  dataset = bigquery.Dataset(dataset_id_full)
  dataset.location = “US”
  dataset = client.create_dataset(dataset, timeout=30)  # Make an API request.
  print(“Created dataset {}.{}”.format(client.project, dataset.dataset_id))

Set up your BigQuery environment for GA4 backfill. This code connects to BigQuery and defines where your backfilled data will be stored. For more information, refer to Google’s developer documentation.

# establish a BigQuery client
client = bigquery.Client.from_service_account_info(credentials_json)
dataset_id = BQ_DATASET_NAME
table_name = BQ_TABLE_NAME
 
# Set the destination table
table_ref = client.dataset(dataset_id).table(table_name)
# Configure the query job.
job_config = bigquery.LoadJobConfig(
    # Specify a (partial) schema. All columns are always written to the
    # table. The schema is used to assist in data type definitions.
    schema=[
    # Specify the type of columns whose type cannot be auto-detected. For
    # example the “title” column uses pandas dtype “object”, so its
    # data type is ambiguous.
    bigquery.SchemaField(“firstUserSource“, bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField(“firstUserMedium“, bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField(“landingPage”, bigquery.enums.SqlTypeNames.STRING),
    bigquery.SchemaField(“sessions”, bigquery.enums.SqlTypeNames.INTEGER),
    # Indexes are written if included in the schema by name.
    ],
    # Optionally, set the write disposition. BigQuery appends loaded rows
    # to an existing table by default, but with WRITE_TRUNCATE write
    # disposition it replaces the table with the loaded data.
    write_disposition=“WRITE_APPEND”,
)
# Set the destination table to where you want to store query results.
# As of google-cloud-bigquery 1.11.0, a fully qualified table ID can be
# used in place of a TableReference.
job = client.load_table_from_dataframe(
    formatted_reponse, table_ref, job_config=job_config
    )  # Make an API request.
job.result()  # Wait for the job to complete.

# Check number of rows in the target table
table = client.get_table(table_ref)
print(
    “Total of {} rows and {} columns to {}”.format(
        table.num_rows, len(table.schema), table_ref
    )
)

Let’s switch back to GCP, click on BigQuery in the navigation menu and check our data:

You’ve now backfilled GA4 data into BigQuery, providing a foundation for comprehensive analysis.

The next step is to create the same snapshot of the GA4 data from the raw BigQuery export, so you can use them for unified reporting.

by Tanya Zyabkina

Tanya Zyabkina has over 15 years of experience leading analytics functions for multiple Fortune 500 companies in the retail, telecom, and higher education. She works as the Director of Marketing Performance Analytics for the The Ohio State University. Go Bucks!