After Data Sync is configured, Pendo exports your Pendo data to the provided cloud storage. This article describes how to load these exports into your data warehouse.
Prerequisites
- Data Sync configured, with at least one destination. For more information, see Set up Data Sync with Google Cloud or Set up Data Sync with Amazon S3.
- Knowledge of data warehouse destination commands or SDK to create and load tables from an avro file source.
Loading overview
A configured Pendo Data Sync export is delivered to the cloud storage bucket path provided during Data Sync setup once per day. This involves creating a cloud storage bucket and a service account that can access that bucket, along with a secret key that gives Pendo the necessary permissions to write data into the bucket.
You can find the bucket path you created in the Destinations table in Settings > Data Management > Bulk Exports > Destinations under Bucket Address.
Once your destination is saved as per the above process, you create an export. Daily exports are delivered to the path that was used as the cloud storage location in Pendo. For example, if the following path was used: gs://pendo-data
, then the daily exports would be delivered to gs://pendo-data/datasync/<application-id>/
.
We create a folder inside of the datasync
folder in your cloud storage for each application that Pendo exports data for. In each application folder, there is an export manifest and a unique hashed folder for each export. The hashed folder holds all the avro files and a bill of materials file for the most recent export. For a description of these files, see Penda Data Sync schema definitions.
This directory would contain the following files and directories. Each of the tables corresponds with each of the avro files you exported from Pendo.
gs://pendo-data/datasync/<subscription-id>/<application-id>/
├── exportmanifest.json
└── <export-uuid>/
├── billofmaterials.json
├── allevents.avro
├── allfeatures.avro
├── allguides.avro
├── allpages.avro
├── alltracktypes.avro
└── matchedEvents/
├── Feature/
│ └── <feature_id>.avro
├── Page/
│ └── <page_id>.avro
└── TrackType/
└── <track_id>.avro
File description
For more information see Pendo Data Sync schema definitions.
File Name | Description |
exportmanifest.json |
Concatenated list of daily export billofmaterials.json . |
billofmaterials.json |
A JSON representation of the export contents. This is used by ETL automation to load exported avro event files into a data warehouse. |
allevents.avro |
All event data. This includes both Pendo events that are associated with a Page, Feature, or Track Event as well as events that are not associated. |
allfeatures.avro |
Description of the application’s exported Features. |
allguides.avro |
Description of application’s exported Guides. |
allpages.avro |
Description of the application’s exported Pages. |
alltracktypes.avro |
Description of application’s exported Track Events. |
<feature_id>.avro |
All events for the given Feature ID. This Feature ID value is the unique identifier that is found in the Pendo UI. |
<page_id>.avro |
All events for the given Page ID. This Page ID value is the unique identifier that is found in the Pendo UI. |
<track_id>.avro |
All events for the given Track Event ID. This Track Event ID value is the unique identifier that is found in the Pendo UI. |
The file names are relative. An absolute file name can be obtained by prepending the rootUrl
field from the export manifest to the relative file name. The rootUrl
also corresponds to the path of the billofmaterials.json
file (excluding its filename).
Bill of materials
The bill of materials documents the contents of the export. Below is an example of what the billofmaterial.json looks like for a Data Sync export.
{
"timestamp": "2023-02-16T20:21:11Z",
"numberOfFiles": 65,
"application": {
"displayName": "Acme CRM",
"id": "-323232"
},
"subscription": {
"displayName": "(Demo) Pendo Experience",
"id": "6591622502678528"
},
"pageDefinitionsFile": [
"allpages.avro"
],
"featureDefinitionsFile": [
"allfeatures.avro"
],
"trackTypeDefinitionsFile": [
"alltracktypes.avro"
],
"guideDefinitionsFile": [
"allguides.avro"
],
"timeDependent": [
{
"periodId": 1675728000,
"allEvents": {
"eventCount": 9515,
"files": [
"allevents.avro"
]
},
"matchedEvents": [
{
"eventCount": 48314,
"files": [
"matchedEvents/Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688.avro"
],
"id": "Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688",
"type": "Page"
},
]
}
Export manifest
The export manifest is a concatenation of multiple bills of materials, with some additional metadata. Below is an example of what the export manifest looks like for a Data Sync export.
{
"exports": [
{
...,
"counter": 1,
"finishTime": "2023-03-03T14:10:15.311651Z",
"storageSize": 12130815,
"rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/0f39bdf6-09c2-4e4d-6d4f-b02c961d8aaf"
},
{
...,
"counter": 2,
"finishTime": "2023-03-03T14:20:12.9489274",
"storageSize": 23462682,
"rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/b979502c-1a01-4569-74cf-e4a7f5049d8f"
}
],
"generatedTime": "2023-03-05T04:17:59.853205005Z"
}
Example load flow
This example creates a separate table in the data warehouse for each event type file. You can load data to suit your needs as long as the data is replaced correctly.
Step 1. Read the most recent export manifest
Read the most recent exportmanifest.json
file to find all unprocessed exports since the last time data was loaded. You can use the counter field as a marker for load progress.
Step 2. Iterate over each entry in the exports list
Cycle through the entries in the list to process the entries and load them into a table.
- Load
allpages
.
If theallpages
table doesn't exist, create it. If theallpages
table exists, drop all data. Then, load all avro files from the list pointed to by thepageDefinitionsFile
field into theallpages
table.
- Load
allfeatures
.
If theallfeatures
table doesn't exist, create it. If theallfeatures
table exists, drop all data. Then, load all avro files from the list pointed to by thefeatureDefinitionsFile
field into theallfeatures
table.
- Load
alltracktypes
.
If thealltracktypes
table doesn't exist, create it. If thealltracktypes
table exists, drop all data. Then, load all avro files from the list pointed to by thetrackTypeDefinitionsFile
field into thealltracktypes
table.
- Load
allguides
.
If theallguides
table doesn't exist, create it. Ifallguides
table exists, drop all data. Then, load all avro files from the list pointed to by theguideDefinitionsFile
field into theallguides
table.
Step 3. Iterate over time-dependent events
Iterate over all items in the timeDependent
list. Then, load allevents.avro
. If the allevents
table doesn't exist, create it. If the allevents
table exists, drop any event data for the given periodID
from the table and append data from the allevents.avro
file to the allevents
table.
Then iterate over all items in the matchedEvents
list. Load events for each event type designated by the id
field. If the table of an event type doesn't exist, create it. If the table for an event type exists, drop any event data for the given periodID
from the table and append data from all avro files in thefiles
field to the event type table.
Loading data files into your data warehouse
When loading the resulting avro files into your data warehouse, you must ensure that you replace the previous event-type description files with each export and use logical avro type mapping.
Loading event-type descriptions
The latest version of the Pendo event-type description files are sent in each export. You must replace this data when loading each Data Sync export. The list of files to be loaded are referenced by the following fields:
pageDefinitionsFile
featureDefinitionsFile
trackTypeDefinitionsFile
guideDefinitionsFile
Wherever avro files are represented as a list, it's possible for that list to be empty. This signifies that content for that period should be dropped. This can occur, for example, in the following scenario:
- A Page was previously tagged and matched a non-zero number of events in a period.
- The Page’s matching rules got updated.
- Pendo reprocessed events for that period, and the new rules don't match any event.
Loading time-dependent events
The timeDependent
block of the bill of materials contains data that is associated with a particular periodId
value. The periodId
signifies which logical time period a particular event resides in. This period equates to a day of event data. You will never find the same event in two different periods.
Load the event avro files with your data warehouse’s option to use logical avro type mapping. This way, the browserTimestamp
value is loaded as a TIMESTAMP and periodId
is loaded as a DATE data type.
Prior to loading any of the avro files in the timeDependent
block, you must drop any data that corresponds to an event type (Page, Feature, or Track) and periodId
before loading the new data.
Pendo can send updates to a given <event type>/<periodId> and if you aren't replacing the data then you can start to accumulate duplicate data in your data warehouse. Re-sending event data can happen when updating event-type rule definitions.