How to Practice GA4 Analysis in MySQL/BigQuery: Build User Paths by Source/Medium
By: Roman Myskin - Sept. 29, 2025
Learn how to practice Google Analytics 4 (GA4) analysis in MySQL by building user paths and tracking source/medium. Step-by-step guide with SQL code, dynamic queries, and session analysis using GA4 event data.
Idea
We have a BigQuery table imported from GA4. The full scheme explanation is available here, but we are interested in event_timestamp, user_preudo_id, and event_params, because we're interested in User Paths only. Our final goal is to make a table of user_pseudo_ids and source/medium steps.
The first thing you face when working with BigQuery - it's their costs. It's not applicable if you have small datasets, but what's the point of data analysis practice if insights are available in a simple CSV file?
Anyway, my practice with BigQuery was always nervous. For example, I paid some dollars working with the Google Transparency Center dataset, because even
SELECT COUNT(*) FROM 'bigquery-public-data.google_ads_transparency_center.creative_stats'
costed 150 GB per call. With 1 TB of free usage, you can imagine it will take 7 calls to start paying, and if you have experience with SQL you need many calls to make sure everything works.
So finally, I don't want to pay extra for practice.
Despite this article primarily using MySQL syntax, I have also translated it into Python and BigQuery. The Python representation is done with Pandas, and if you are familiar with its terminology, it provides a better representation than SQL, which is table-oriented and requires considerable practice to understand the logic behind the queries.
Download the data set
First, you need to copy this dataset to your BigQuery project so you can upload it to your PC. In the next article, I will generate this dataset synthetically, because BigQuery sample dataset for Google Analytics ecommerce web implementation has some natural data, but it is not applicable for User Paths analysis (they have only one cpc source/medium: google/cpc)
CREATE TABLE `my_project.my_dataset.my_table` AS SELECT * FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_202*`
my_project → your project ID
my_dataset → your dataset ID
my_table → the new table you are creating
The events_202* pattern selects all GA4 events tables for 2021 (partitioned by date).
BigQuery tables are referenced as: project_id.dataset_id.table_id
Project ID → Your Google Cloud project name. You can see it in the BigQuery console at the top left.
Dataset ID → The dataset (folder) inside your project. It groups related tables.
Table ID → The specific table you want to query or create.
Next, you need to use Python to upload data from BigQuery to MySQL:
import pandas_gbq
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from google.oauth2 import service_account
# --- MySQL connection via SQLAlchemy ---
#TODO fill your local MySQL user and password
engine = create_engine(
"mysql+pymysql://user:password@localhost/mydb?charset=utf8mb4"
)
# --- BigQuery base SQL ---
base_sql = """
SELECT
event_date,
event_timestamp,
event_name,
TO_JSON_STRING(event_params) AS event_params,
event_previous_timestamp,
IFNULL(event_value_in_usd, 0) AS event_value_in_usd,
event_bundle_sequence_id,
event_server_timestamp_offset,
user_id,
user_pseudo_id,
TO_JSON_STRING(privacy_info) AS privacy_info,
TO_JSON_STRING(user_properties) AS user_properties,
user_first_touch_timestamp,
TO_JSON_STRING(user_ltv) AS user_ltv,
TO_JSON_STRING(device) AS device,
TO_JSON_STRING(geo) AS geo,
TO_JSON_STRING(app_info) AS app_info,
TO_JSON_STRING(traffic_source) AS traffic_source,
stream_id,
platform,
TO_JSON_STRING(event_dimensions) AS event_dimensions,
TO_JSON_STRING(ecommerce) AS ecommerce,
TO_JSON_STRING(items) AS items
FROM `my_project.my_dataset.my_table`
"""
credentials = service_account.Credentials.from_service_account_file('secret.json')
pandas_gbq.context.credentials = credentials
pandas_gbq.context.project = "my_project"
total_rows = 4295584
chunk_size = 100000
for offset in range(100000, total_rows, chunk_size):
sql = f"{base_sql} LIMIT {chunk_size} OFFSET {offset}"
print(f"Fetching rows {offset} to {offset + chunk_size}...")
df = pandas_gbq.read_gbq(sql)
if df.empty:
print("No more rows to fetch.")
break
# Fill numeric NaN with 0
for col in df.select_dtypes(include=[np.number]).columns:
df[col] = df[col].fillna(0)
# Replace text NaN with None
for col in df.select_dtypes(include=['object']).columns:
df[col] = df[col].where(pd.notnull(df[col]), None)
# --- Push chunk to MySQL ---
df.to_sql(
'ga4_ecom',
con=engine,
if_exists='append',
index=False
)
print(f"Inserted chunk ending at row {offset + len(df)}")
print("All data inserted successfully.")
Note, I processed 43 queries (chunks) to upload 4295584 raws from the sample dataset.
You need to have a Google Cloud project with BigQuery API enabled and create a secret from a service account. You can read about credentials more here, and it looks complex, but shortly, you need to go to AIM & Admin -> Service Accounts, create a service account, create a secret key, and grant access with the IAM page.
MySQL queries
In this section, I explain the step-by-step process of working with GA4 data with the final Procedure to create all tables in one Call.
The first table we need to create is an unnested table of JSON values in event_params. We're interested in ga_session_id, source, medium.
These queries create the first table. We uploaded data to ga4_ecom table, but we'll be using the test_sample dataset first created with this query:
CREATE TABLE test_sample AS SELECT * FROM ga4_ecom LIMIT 10000
Because the final query will take a lot of time.
Select a tab to see Python (pandas), MySQL (JSON_TABLE) or BigQuery (UNNEST + JSON_VALUE).
MySQL — using JSON_TABLE
CREATE TABLE jjson_table_sample AS
SELECT
t.event_timestamp,
t.user_pseudo_id,
tt.kkey,
tt.int_value,
tt.string_value
FROM test_sample AS t
JOIN JSON_TABLE(
t.event_params,
"$[*]" COLUMNS (
kkey VARCHAR(100) PATH "$.key",
int_value BIGINT PATH "$.value.int_value",
string_value VARCHAR(255) PATH "$.value.string_value"
)
) AS tt
WHERE tt.kkey IN ('ga_session_id','source','medium');
Python — using pandas (example)
import pandas as pd
import json
# Raw data as a list of dictionaries
data = [
{
"event_timestamp": 1695984000000,
"user_pseudo_id": "user_123",
"event_params": [
{"key": "ga_session_id", "int_value": 1, "string_value": None},
{"key": "source", "int_value": None, "string_value": "google"},
{"key": "medium", "int_value": None, "string_value": "cpc"}
]
},
{
"event_timestamp": 1695984060000,
"user_pseudo_id": "user_123",
"event_params": [
{"key": "ga_session_id", "int_value": 2, "string_value": None},
{"key": "source", "int_value": None, "string_value": "meta"},
{"key": "medium", "int_value": None, "string_value": "cpc"}
]
},
{
"event_timestamp": 1695984120000,
"user_pseudo_id": "user_456",
"event_params": [
{"key": "ga_session_id", "int_value": 3, "string_value": None},
{"key": "source", "int_value": None, "string_value": "google"},
{"key": "medium", "int_value": None, "string_value": "organic"}
]
}
]
# Flatten event_params
rows = []
for row in data:
flat_row = {
"event_timestamp": row["event_timestamp"],
"user_pseudo_id": row["user_pseudo_id"]
}
for param in row["event_params"]:
value = param["int_value"] if param["int_value"] is not None else param["string_value"]
flat_row[param["key"]] = value
rows.append(flat_row)
# Create DataFrame
jjson_table_sample = pd.DataFrame(rows)
jjson_table_sample = jjson_table_sample[
jjson_table_sample["kkey"].isin(["ga_session_id","source","medium"])
]
BigQuery — using UNNEST + JSON_VALUE
CREATE TABLE dataset.jjson_table_sample AS
SELECT
event_timestamp,
user_pseudo_id,
JSON_VALUE(param, '$.key') AS kkey,
SAFE_CAST(JSON_VALUE(param, '$.value.int_value') AS INT64) AS int_value,
JSON_VALUE(param, '$.value.string_value') AS string_value
FROM `project.dataset.test_sample` AS t,
UNNEST(JSON_EXTRACT_ARRAY(t.event_params)) AS param
WHERE JSON_VALUE(param, '$.key') IN ('ga_session_id','source','medium');
Note GA4 table contains many events like page_view, scroll, begin_checkout, purchase, and we don't need all of them. So we apply session_id to every event_timestamp, user_pseudo_id taking their MAX value of int_value by Partitions
Applied Session ID
Event Timestamp
User Pseudo ID
Key
Int Value
String Value
Session ID
1695984000000(page_view)
user_123
ga_session_id
000000001
null
000000001
1695984000000(page_view)
user_123
source
null
google
000000001
1695984000000(page_view)
user_123
medium
null
cpc
000000001
1695984100000(purchase)
user_123
ga_session_id
000000001
null
000000001
1695984100000(purchase)
user_123
source
null
google
000000001
1695984100000(purchase)
user_123
medium
null
cpc
000000001
1695984060000
user_123
ga_session_id
000000002
null
000000002
1695984060000
user_123
source
null
meta
000000002
1695984060000
user_123
medium
null
cpc
000000002
1695984120000
user_456
ga_session_id
000000003
null
000000003
1695984120000
user_456
source
null
google
000000003
1695984120000
user_456
medium
null
organic
000000003
Select a tab to see MySQL or Python.
MySQL/BigQuery
SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table;
Now let's form a table with session_start, user_pseudo_id, session_id by taking the minimal event_timestamp grouped by user_pseudo_id and session_id
Session start
User Pseudo ID
Session ID
1695984000000
user_123
000000001
1695984060000
user_123
000000002
1695984120000
user_456
000000003
Select a tab to see MySQL or Python.
MySQL/BigQuery
CREATE TABLE time_user_session AS
WITH sess AS
(SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table)
SELECT
MIN(event_timestamp) AS session_start,
user_pseudo_id,
session_id
FROM sess
GROUP BY
user_pseudo_id,
session_id;
Next step, we need to concatenate our source/medium into one column. To do it, we use a Left join to the just-created time_user_session table (so we filter non-session start rows to prevent duplicates). Also, we need to fill empty sources or mediums with 'not defined'.
Select a tab to see MySQL or Python.
MySQL/BigQuery
CREATE TABLE start_user_session_source AS
SELECT
t.session_start,
t.user_pseudo_id,
t.session_id,
CONCAT_WS(
'/',
COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
) AS source_medium
FROM time_user_session t
LEFT JOIN jjson_table rt
ON t.session_start = rt.event_timestamp
AND t.user_pseudo_id = rt.user_pseudo_id
GROUP BY t.session_start, t.user_pseudo_id, t.session_id;
The next step is as brilliant as simple. If you know Window functions, there is the ROW_NUMBER function, which applies row numbers to every partition. So we need to sort our table by session_start and form columns for 1st row number, for 2nd row number, for 3rd and so on. It looks like this
SELECT
user_id,
MAX(CASE WHEN rn = 1 THEN source END) AS step1,
MAX(CASE WHEN rn = 2 THEN source END) AS step2,
MAX(CASE WHEN rn = 3 THEN source END) AS step3
-- Continue if you need more steps
FROM (
SELECT
user_id,
source,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY start_session) AS rn
FROM sessions
) AS ordered
GROUP BY user_id;
But we want a dynamic procedure that can be called in one string.
Select a tab to see Python, MySQL or BigQuery.
MySQL
DELIMITER $$
CREATE PROCEDURE build_user_path()
BEGIN
-- Declare constants for loops
DECLARE max_steps INT DEFAULT 0;
DECLARE i INT DEFAULT 1;
DROP TABLE IF EXISTS jjson_table,time_user_session,start_user_session_source;
CREATE TABLE jjson_table AS
SELECT event_timestamp, user_pseudo_id, kkey, int_value, string_value FROM
ga4_ecom,
JSON_TABLE(
event_params,
"$[*]" COLUMNS
(kkey VARCHAR(100) PATH "$.key",
int_value BIGINT PATH "$.value.int_value",
string_value VARCHAR(100) PATH "$.value.string_value")
) AS tt
WHERE tt.kkey IN ('ga_session_id','source','medium');
CREATE TABLE time_user_session AS
WITH sess AS
(SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM jjson_table)
SELECT
MIN(event_timestamp) AS session_start,
user_pseudo_id,
session_id
FROM sess
GROUP BY
user_pseudo_id,
session_id;
CREATE TABLE start_user_session_source AS
SELECT
t.session_start,
t.user_pseudo_id,
t.session_id,
CONCAT_WS(
'/',
COALESCE(MAX(CASE WHEN rt.kkey = 'source' THEN rt.string_value END), 'not defined'),
COALESCE(MAX(CASE WHEN rt.kkey = 'medium' THEN rt.string_value END), 'not defined')
) AS source_medium
FROM time_user_session t
LEFT JOIN jjson_table rt
ON t.session_start = rt.event_timestamp
AND t.user_pseudo_id = rt.user_pseudo_id
GROUP BY t.session_start, t.user_pseudo_id, t.session_id;
-- Maximum number of steps per user in a local procedure variable
SELECT MAX(cnt) INTO max_steps
-- Creates a table user_pseudo_id and step counts
FROM (
SELECT user_pseudo_id, COUNT(*) AS cnt
FROM start_user_session_source
GROUP BY user_pseudo_id
) t;
-- 2) Build CASE expressions into a user variable (PREPARE requires a user var)
SET @case_expressions = '';
SET i = 1;
WHILE i <= max_steps DO
IF i > 1 THEN
SET @case_expressions = CONCAT(@case_expressions, ', ');
END IF;
SET @case_expressions = CONCAT(
@case_expressions,
'MAX(CASE WHEN rn = ', i, ' THEN source_medium END) AS step', i
);
SET i = i + 1;
END WHILE;
-- 3) Build the final SQL into a user variable @sql_text
SET @sql_text = CONCAT(
'SELECT user_pseudo_id, ', @case_expressions,
' FROM (',
'SELECT user_pseudo_id, source_medium, ',
'ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_start) AS rn ',
'FROM start_user_session_source',
') AS ordered ',
'GROUP BY user_pseudo_id'
);
-- drop existing target (if any)
DROP TABLE IF EXISTS user_paths;
-- create table from dynamic SQL
SET @create_sql = CONCAT('CREATE TABLE user_paths AS ', @sql_text);
PREPARE stmt FROM @create_sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END $$
DELIMITER ;
CREATE OR REPLACE PROCEDURE `my_project.my_dataset.build_user_path`()
BEGIN
DECLARE max_steps INT64;
DECLARE i INT64 DEFAULT 1;
DECLARE case_expressions STRING DEFAULT '';
DECLARE sql_text STRING;
-- 1) Extract event params into a temporary table
CREATE OR REPLACE TABLE `my_project.my_dataset.jjson_table` AS
SELECT
event_timestamp,
user_pseudo_id,
param.key AS kkey,
param.value.int_value AS int_value,
param.value.string_value AS string_value
FROM
`my_project.my_dataset.my_table`,
UNNEST(event_params) AS param
WHERE
param.key IN ('ga_session_id', 'source', 'medium');
-- 2) Aggregate sessions
CREATE OR REPLACE TABLE `my_project.my_dataset.time_user_session` AS
WITH sess AS (
SELECT
event_timestamp,
user_pseudo_id,
kkey,
int_value,
string_value,
MAX(int_value) OVER(PARTITION BY event_timestamp, user_pseudo_id) AS session_id
FROM `my_project.my_dataset.jjson_table`
)
SELECT
MIN(event_timestamp) AS session_start,
user_pseudo_id,
session_id
FROM sess
GROUP BY user_pseudo_id, session_id;
-- 3) Combine source/medium per session
CREATE OR REPLACE TABLE `my_project.my_dataset.start_user_session_source` AS
SELECT
t.session_start,
t.user_pseudo_id,
t.session_id,
CONCAT(
COALESCE(MAX(IF(rt.kkey = 'source', rt.string_value, NULL)), 'not defined'),
'/',
COALESCE(MAX(IF(rt.kkey = 'medium', rt.string_value, NULL)), 'not defined')
) AS source_medium
FROM `my_project.my_dataset.time_user_session` t
LEFT JOIN `my_project.my_dataset.jjson_table` rt
ON t.session_start = rt.event_timestamp
AND t.user_pseudo_id = rt.user_pseudo_id
GROUP BY t.session_start, t.user_pseudo_id, t.session_id;
-- 4) Find max steps per user
SET max_steps = (
SELECT MAX(cnt)
FROM (
SELECT user_pseudo_id, COUNT(*) AS cnt
FROM `my_project.my_dataset.start_user_session_source`
GROUP BY user_pseudo_id
)
);
-- 5) Build dynamic pivot using EXECUTE IMMEDIATE
WHILE i <= max_steps DO
IF i > 1 THEN
SET case_expressions = CONCAT(case_expressions, ', ');
END IF;
SET case_expressions = CONCAT(
case_expressions,
'MAX(IF(rn = ', i, ', source_medium, NULL)) AS step', i
);
SET i = i + 1;
END WHILE;
-- 6) Final dynamic SQL
SET sql_text = CONCAT(
'CREATE OR REPLACE TABLE `my_project.my_dataset.user_paths` AS ',
'SELECT user_pseudo_id, ', case_expressions, ' FROM (',
'SELECT user_pseudo_id, source_medium, ',
'ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY session_start) AS rn ',
'FROM `my_project.my_dataset.start_user_session_source`',
') AS ordered ',
'GROUP BY user_pseudo_id'
);
-- 7) Execute dynamic pivot
EXECUTE IMMEDIATE sql_text;
END;
And the final result is pretty impressive
User Path Table
user_pseudo_id
step1
step2
step3
step4
step5
step6
step7
step8
step9
step10
step11
step12
10028188.3857887509
<Other>/referral
(data deleted)/(data deleted)
(data deleted)/(data deleted)
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
10101905.8688293836
google/organic
shop.googlemerchandisestore.com/referral
shop.googlemerchandisestore.com/referral
not defined/not defined
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
10115718.4867090359
google/organic
shop.googlemerchandisestore.com/referral
<Other>/<Other>
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
1011939.1001718987
analytics.google.com/referral
google/organic
(data deleted)/(data deleted)
not defined/not defined
creatoracademy.youtube.com/referral
NULL
NULL
NULL
NULL
NULL
NULL
NULL
1018731.0091054136
<Other>/<Other>
(data deleted)/(data deleted)
<Other>/referral
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
1023953.9842241768
<Other>/<Other>
<Other>/<Other>
<Other>/referral
(direct)/(none)
google/organic
not defined/not defined
not defined/not defined
NULL
NULL
NULL
NULL
NULL
1026932.0858862293
<Other>/referral
google/organic
<Other>/referral
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
1033552.6644233006
analytics.google.com/referral
(direct)/(none)
analytics.google.com/referral
(data deleted)/(data deleted)
google/organic
NULL
NULL
NULL
NULL
NULL
NULL
NULL
1033716.6724392262
(direct)/(none)
shop.googlemerchandisestore.com/referral
google/organic
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
10404067.7211955661
google/organic
(data deleted)/(data deleted)
google/organic
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
And you will answer a logic question. "Cool! What am I going to do with 188 unique source/medium and 6398 unique User paths (I'm not telling about user distribution between them)".
Here's the time for Logistic Regression I'm going to tell about next time