BigQueryとGA4の連携で、ユーザーのページ遷移を追ってみる

最近BigQueryを使い始めました。

興味があるものの、使うまでに時間がかかっていたのは、生データ以外の魅力として「それGA4でもよくない？」と感じる部分が多かったからです。

あとはXを見ていると、「BigQueryも使えないくせにGA4なんかのぬるいデータ見てるんじゃねーよ」といったBigQuery奉行が結構いらっしゃって、とっても怖かったからです。
~~下手な発信したら噛みつかれそう。~~

ユーザーのページ遷移を追うことの難しさ

この辺、みなさまも悩まれる部分ではないでしょうか。

UAの時代からしっくりこなかったり、GA4の経路データ探索もイマイチ。
「なんとなく想像していたものじゃない」って感じですよね。

これ、僕のセンスの無さでしょうかねｗ

タグマネージャーを使って、力技で数値を取ったり等々さまざまなサイト様も見受けられますが…。
もう少しカジュアルにこの動きを見たいなぁと思ってます。

そこで出会ったSQL

株式会社HAPPY ANALYTICS様の運営する、Google Analytics 4 ガイド – アクセス解析ツール「Google Analytics 4」の実装・設定・活用のための情報サイトという非常にありがたいサイトを見つけます。

見つけると言うより、失礼な表現ですが、BigQuery×GA4のSQLや活用を調べていて、何かとわかりやすいと感じる・ヒットするサイトがハピアナさんのサイトしか無かった…。

さすが小川さん・ハピアナさん…。

参考にしたSQL：指定ページの１つ次に見たページを取得

まず、参考にさせていただいたSQLを掲載しておきます。

with predata as (
select
    user_pseudo_id,  --ユーザーのCookieIDを取得
    (select value.int_value from unnest(event_params) where event_name = 'page_view' and key = 'ga_session_id') as session_id,--ユーザーのセッションIDを取得
    (select value.string_value from unnest(event_params) where key = 'page_location') as page,--ページのURLを取得
    event_timestamp　--イベントの発生時間を取得
from
    `ha-ga4.analytics_227084301.events_*`   　-- データの選択範囲。ここでは全期間とし、whereの部分で日付を指定する
where
    _table_suffix between '20220201' and '20220205' 　-- 日付の指定
    and  event_name = 'page_view'),　-- イベント名を指定
 
prep_nextpage as (
select 
    user_pseudo_id, --ユーザーのCookieIDを取得
    session_id, --ユーザーのセッションIDを取得
    page, --ページのURLを取得
    lead(page,1) over (partition by user_pseudo_id,session_id order by event_timestamp) as next_page,　--イベント時間昇順に並びかえた後に、leadを使ってpage列の1つ先の値を取得してnext_pageに格納
    event_timestamp --イベントの発生時間を取得
from 
    predata--上記のデータをpredataのクエリ結果から取得
)
select 
    page,　--URLを取得
    ifnull(next_page,'(exit)') as next_page, --次のページを取得
    count(distinct concat(user_pseudo_id,"-", session_id)) as count　--CookieIDとセッションIDをつなげユニーク数をカウントする
from
    prep_nextpage　--prep_nextpageのクエリ結果から取得
where
    page='https://happyanalytics.co.jp/'　--起点となるURLを指定
group by
    page,next_page　--ページと次ページでグルーピング
order by
    count desc　--カウントの降順で並び替え

指定したページの次ページのURLを追うSQLです。

結果例はこちら

引用https://lh4.googleusercontent.com/GK6P0MLkqpMfiM3Gv3tXjtGfmCKTRRy_1dF0etr6zWgr3-jKiLOaDeAS7p40TYxNWQTiQHV0sRGSTTLKc_bi6Ut1KYFjhvjTZQUy1Ubw7VabQ74OiH9JQJa00KRI4vKUp9OeLEb7zdZVNhz-xp_g9KVWG0k1dda4G2Jrq-fp_W6BGYhx989Zwrc1

遷移数と、遷移先のページが表示されます。
正直、ここまではよくあるデータかなとは思います。

今回のお題となるSQL：指定ページの次に見たページを10先まで取得

先ほど、よくあるデータと表現したのは、1つ先・後ろはUA時代にもなんとなく見れたんですよね。

生データ叩いて出しているので、同じ1つ前後でも精度が違うことは理解しているんですが、BigQuery使っている感が無いんです。

そこで、「1つ先とかではなく、最後まで追えないの？」という意向のもと、

「10ページ先までの遷移」SQLを組んでみました。

with predata as (
    select
        user_pseudo_id,  --ユーザーのCookieIDを取得
        (select value.int_value from unnest(event_params) where event_name = 'page_view' and key = 'ga_session_id') as session_id, --ユーザーのセッションIDを取得
        (select value.string_value from unnest(event_params) where key = 'page_location') as page, --ページのURLを取得
        event_timestamp --イベントの発生時間を取得
    from
        `sample-project1234.analytics_123456789.events_202410*` -- データの選択範囲
    where
        event_name = 'page_view' -- イベント名を指定
),

prep_nextpages as (
    select 
        user_pseudo_id, --ユーザーのCookieIDを取得
        session_id, --ユーザーのセッションIDを取得
        page, --ページのURLを取得
        lead(page, 1) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_1, -- 1つ先のページ
        lead(page, 2) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_2, -- 2つ先のページ
        lead(page, 3) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_3, -- 3つ先のページ
        lead(page, 4) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_4, -- 4つ先のページ
        lead(page, 5) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_5, -- 5つ先のページ
        lead(page, 6) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_6, -- 6つ先のページ
        lead(page, 7) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_7, -- 7つ先のページ
        lead(page, 8) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_8, -- 8つ先のページ
        lead(page, 9) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_9, -- 9つ先のページ
        lead(page, 10) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_10, -- 10つ先のページ
        event_timestamp --イベントの発生時間を取得
    from 
        predata --上記のデータをpredataのクエリ結果から取得
)
select 
    page, --URLを取得
    ifnull(next_page_1, '(exit)') as next_page_1, --次のページを取得 (1つ先)
    ifnull(next_page_2, '(exit)') as next_page_2, --次のページを取得 (2つ先)
    ifnull(next_page_3, '(exit)') as next_page_3, --次のページを取得 (3つ先)
    ifnull(next_page_4, '(exit)') as next_page_4, --次のページを取得 (4つ先)
    ifnull(next_page_5, '(exit)') as next_page_5, --次のページを取得 (5つ先)
    ifnull(next_page_6, '(exit)') as next_page_6, --次のページを取得 (6つ先)
    ifnull(next_page_7, '(exit)') as next_page_7, --次のページを取得 (7つ先)
    ifnull(next_page_8, '(exit)') as next_page_8, --次のページを取得 (8つ先)
    ifnull(next_page_9, '(exit)') as next_page_9, --次のページを取得 (9つ先)
    ifnull(next_page_10, '(exit)') as next_page_10, --次のページを取得 (10つ先)
    count(distinct concat(user_pseudo_id, "-", session_id)) as count --CookieIDとセッションIDをつなげユニーク数をカウントする
from
    prep_nextpages --prep_nextpagesのクエリ結果から取得
where
    page = 'https://sample.com' --起点となるURLを指定
group by
    page, next_page_1, next_page_2, next_page_3, next_page_4, next_page_5, next_page_6, next_page_7, next_page_8, next_page_9, next_page_10 -- ページと次ページでグルーピング
order by
    count desc --カウントの降順で並び替え

活用される際は、

`sample-project1234.analytics_123456789.events_202410*`

をご自身のデータのもの

'https://sample.com'

を軸・起点としたいURL

に合わせてくださいね。

`LEAD`関数の使用

蛇足ですが、既存コードの流れを踏襲し、コードをシンプルに拡張していきます。

lead(page, 1), lead(page, 2), lead(page, 3), lead(page, 4)と順に使うことで、1つ先から4つ先のページURLを取得しています。partition by user_pseudo_id, session_id order by event_timestampにより、同一ユーザーのセッション内でのページ遷移を追跡します。

BigQueryではこの「user_pseudo_id」と呼ばれる、ユーザーの識別Cookieを軸に触れる部分が面白いですね。

グルーピングとカウント

最終的にpage, next_page_1, next_page_2, next_page_3, next_page_4でグルーピングして、各ページ遷移パターンの数を集計します。

出口（exit）の扱い

ifnull(next_page_x, '(exit)')を使って、次のページが存在しない場合は'(exit)'と表示しています。

※ちなみに、私はSQLのプロエンジニアではありません。ある程度理解していますが、
既存コードを細かく解析し、AI等と情報を整理した上で理解・組み直しをおこなっております。ご了承ください。

結果はこちら

キャプチャでは5ページ先までしか取れていませんが、csvでデータを見たところ、MAX10まで取れているデータもちらほらありました。

なぜ、10個先にしたのか？

理由は2つありまして、

BigQuery叩く容量が大きくなる怖さ
矛盾するがあまり先まで見ても参考にしづらい

解説していきます。

BigQuery叩く容量が大きくなる怖さ

BigQueryは従量課金なので、縦横無尽にSQL叩きまくるとあっという間にお金がかかります。

もちろん、データ量等で左右されるのですが…。

テスト的に10個でどんなもんか？の感覚を試したい意図もありました。

矛盾するがあまり先まで見ても参考にしづらい

ページ遷移は見れれば見れる分だけ面白いのですが、いつまでも終わらない遷移を見てもキリがないんですよねｗ

それこそ、離脱までの傾向とCVまでの傾向や数が一定数追えれば十分です。

202412_追記：ランディングページを軸にしたverも追記

上記もとりあえず見るには良いデータでしたが、ランディングページじゃないことに気づきました。
もう少しSEOっぽいデータにしたいですよね。

with predata as (
    select
        user_pseudo_id,
        (select value.int_value from unnest(event_params) where event_name = 'page_view' and key = 'ga_session_id') as session_id,
        (select value.string_value from unnest(event_params) where key = 'page_location') as page,
        event_timestamp
    from
        `sample-project1234.analytics_123456789.events_202410*` -- データの選択範囲
    where
        event_name = 'page_view' -- イベント名を指定
),

landing_pages as (
    select
        user_pseudo_id,
        session_id,
        page as landing_page,
        event_timestamp as landing_page_time,
        row_number() over (partition by user_pseudo_id, session_id order by event_timestamp) as rn
    from
        predata
    where
        page like 'https://sample.com/blog%' -- ランディングページ候補
),

pv_with_landing as (
    select
        p.user_pseudo_id,
        p.session_id,
        p.page,
        p.event_timestamp,
        l.landing_page,
        l.landing_page_time
    from
        predata p
    left join landing_pages l
    on p.user_pseudo_id = l.user_pseudo_id
       and p.session_id = l.session_id
    where
        l.rn = 1 --ランディングページをセッションごとに取得
),

prep_nextpages as (
    select 
        user_pseudo_id,
        session_id,
        page,
        landing_page,
        lead(page, 1) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_1,
        lead(page, 2) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_2,
        lead(page, 3) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_3,
        lead(page, 4) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_4,
        lead(page, 5) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_5,
        lead(page, 6) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_6,
        lead(page, 7) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_7,
        lead(page, 8) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_8,
        lead(page, 9) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_9,
        lead(page, 10) over (partition by user_pseudo_id, session_id order by event_timestamp) as next_page_10,
        event_timestamp
    from 
        pv_with_landing
)

select 
    landing_page, --ランディングページURL
    page, --現在のページ
    ifnull(next_page_1, '(exit)') as next_page_1,
    ifnull(next_page_2, '(exit)') as next_page_2,
    ifnull(next_page_3, '(exit)') as next_page_3,
    ifnull(next_page_4, '(exit)') as next_page_4,
    ifnull(next_page_5, '(exit)') as next_page_5,
    ifnull(next_page_6, '(exit)') as next_page_6,
    ifnull(next_page_7, '(exit)') as next_page_7,
    ifnull(next_page_8, '(exit)') as next_page_8,
    ifnull(next_page_9, '(exit)') as next_page_9,
    ifnull(next_page_10, '(exit)') as next_page_10,
    count(distinct concat(user_pseudo_id, "-", session_id)) as count --CookieIDとセッションIDをつなげユニーク数をカウントする
from
    prep_nextpages
group by
    landing_page, page, next_page_1, next_page_2, next_page_3, next_page_4, next_page_5, next_page_6, next_page_7, next_page_8, next_page_9, next_page_10 -- ランディングページとページ遷移でグルーピング
order by
    count desc;

23行目の箇所で、ランディングページを指定しています。
下記を参考に指定方法は様々変えてみてください。

-- 「https://sample.com/blog」から始まるランディングページ
  page like 'https://sample.com/blog%'

-- 「https://sample.com/blog」を含むランディングページ
  page like '%/blog%'

すみません。詳しい解説は追って追記します！

取り急ぎ。

感じたこと・まとめ

既に似たようなことをやられていたり、もう少し丁寧な仕組みを組んでいる方もいらっしゃると思います。

SQLの条件次第で、本当に色々なデータが整理できそうだなと感激です。
正直、BigQueryを舐めてました。すみません。

あとは、今後課金がしっかり発生するタイミングで、費用対効果といいますか、どこまでお金をかけて価値のあるデータ整理ができるかが重要ですね。

自分で特定のテーブルを組んで、そこにPythonを掛け合わせてクラスター分析を行う、なんて学習も進めておりますが、なかなか難しい。私のリテラシーでは、SEOベースのマーケに活用できるか怪しいのですが、この辺は別途記事にしようと思っております。

▼202411〜：追加でBigQuery記事更新しました。
【BigQuery×GA4】広告アカウント無しで線形アトリビューション分析を行う

【BigQuery×GA4】CVした・していないユーザーのPVページランキングを抽出

葛西洋介

都内Webマーケティング企業に勤務。対応サイトは過去大小含め80サイト以上。
企業サイトの案件が多く、SEOというよりは事業理解やマーケティング側面を大切にしています。
また、上流から下流まですべて対応する経験が多く、大抵のことは自分で実装まで対応可能です。
最近はAWSの理解とPythonとSEOの組み合わせ研究が楽しいです。

Galatea