After Data Sync is configured, Pendo exports your Pendo data to the provided cloud storage. This article describes how to load these exports into your data warehouse.
For sample code showing the processing Data Sync exports from Google Cloud Storage to Google BigQuery, see data-sync-gcp-export-loading-example.
- Data Sync configured, with at least one destination. For more information, see Set up Data Sync with Google Cloud, Set up Data Sync with Amazon S3, or Set up Data Sync with Azure Microsoft Storage.
- Knowledge of data warehouse destination commands or SDK to create and load tables from an avro file source.
We currently cap the Avro file size at 4GB. Because these files can get large, they are intended to be streamed into your ETL and warehouse.
A configured Pendo Data Sync export is delivered to the cloud storage bucket path provided during Data Sync setup once per day. This involves creating a cloud storage bucket and a service account that can access that bucket, along with a secret key that gives Pendo the necessary permissions to write data into the bucket.
You can find the bucket path you created in the Destinations table in Settings > Data Management > Bulk Exports > Destinations under Bucket Address.
Once your destination is saved as per the above process, you create an export. Daily exports are delivered to the path that was used as the cloud storage location in Pendo. For example, if the following path was used:
gs://pendo-data, then the daily exports would be delivered to
We create a folder inside of the
datasync folder in your cloud storage for each application that Pendo exports data. In each application folder, there is an export manifest, which gets updated after an export completes for that application, and a unique hashed folder for each export run. The hashed folder holds all the avro files and a bill of materials file for the most recent export. For a description of these files, see Pendo Data Sync schema definitions.
This directory would contain the following files and directories. Each of the tables corresponds with each of the avro files you exported from Pendo.
│ └── <feature_id>.avro
│ └── <page_id>.avro
Updates to exported data
There are a few cases where data for days previously exported need to be re-exported. These exports can be handled the same way as regular exports because the directory structure and schema remain the same.
When rules for Features and Pages are added or updated, Pendo applies new or updated tags to the lifetime of your data since Pendo was installed. The reprocessed data is handled seamlessly in the Pendo UI.
To keep your previously-exported data up-to-date with the reprocessing that occurred in the Pendo UI, we initiate a retroactive export whenever rules are added or changed, which re-exports data for any days previously exported through an existing historical or recurring export. The contents of this export are the same as of a regular export, with the following exceptions:
There is no
allevents.avrofile because retroactive exports only re-export event data for the reprocessed Feature or Page.
We re-export event data for the reprocessed Feature or Page for any dates previously exported by a historical or recurring export.
Pendo event data is finalized after eight days. Until that time, small changes can occur to your aggregate data as Pendo collects additional data, for example, if session information from yesterday isn't captured until today because a user left their browser window open. These changes are handled seamlessly in the Pendo UI.
To reflect these changes in Data Sync exports, we send two separate exports for finalized data (eight days out) and unfinished data (from yesterday) for a daily export. For example, for today’s daily export, you can expect two exports in the form of two uniquely hashed folders to be delivered to
gs://pendo-data/datasync/<subscription-id>/<application-id>/: one containing data from eight days ago, and one containing data from yesterday. This also means that in eight days, you receive a finalized version of today’s data.
Prior to July 14, 2023, only finalized data was exported. If you enabled a recurring export before this date, you see two exports instead of one for each day for finalized and unfinished data starting on July 14, 2023. Unfinished data isn't exported for the seven days prior to this date. Those days are covered when the finalized data is sent for those days in the seven days following July 14, 2023.
For more information see Pendo Data Sync schema definitions.
||Concatenated list of daily export
||A JSON representation of the export contents. This is used by ETL automation to load exported avro event files into a data warehouse.|
||All event data. This includes both Pendo events that are associated with a Page, Feature, or Track Event as well as events that are not associated.|
||Description of the application’s exported Features.|
||Description of application’s exported Guides.|
||Description of the application’s exported Pages.|
||Description of application’s exported Track Events.|
All events for the given Feature ID. This Feature ID value is the unique identifier that is found in the Pendo UI.
All events for the given Page ID. This Page ID value is the unique identifier that is found in the Pendo UI.
All events for the given Track Event ID. This Track Event ID value is the unique identifier that is found in the Pendo UI.
The file names are relative. An absolute file name can be obtained by prepending the
rootUrl field from the export manifest to the relative file name. The
rootUrl also corresponds to the path of the
billofmaterials.json file (excluding its filename).
Bill of materials
The bill of materials documents the contents of the export. Below is an example of what the billofmaterial.json looks like for a Data Sync export.
"displayName": "Acme CRM",
"displayName": "(Demo) Pendo Experience",
The export manifest is a concatenation of multiple bills of materials, with some additional metadata. Below is an example of what the export manifest looks like for a Data Sync export.
Example load flow
This example creates a separate table in the data warehouse for each event type file. You can load data to suit your needs as long as the data is replaced correctly.
Step 1. Read the most recent export manifest
Read the most recent
exportmanifest.json file to find all unprocessed exports since the last time data was loaded. You can use the counter field as a marker for load progress.
Step 2. Iterate over each entry in the exports list
Cycle through the entries in the list to process the entries and load them into a table.
allpagestable doesn't exist, create it. If the
allpagestable exists, drop all data. Then, load all avro files from the list pointed to by the
pageDefinitionsFilefield into the
allfeaturestable doesn't exist, create it. If the
allfeaturestable exists, drop all data. Then, load all avro files from the list pointed to by the
featureDefinitionsFilefield into the
alltracktypestable doesn't exist, create it. If the
alltracktypestable exists, drop all data. Then, load all avro files from the list pointed to by the
trackTypeDefinitionsFilefield into the
allguidestable doesn't exist, create it. If
allguidestable exists, drop all data. Then, load all avro files from the list pointed to by the
guideDefinitionsFilefield into the
Step 3. Iterate over time-dependent events
Iterate over all items in the
timeDependent list. Then, load
allevents.avro. If the
allevents table doesn't exist, create it. If the
allevents table exists, drop any event data for the given
periodID from the table and append data from the
allevents.avro file to the
Then iterate over all items in the
matchedEvents list. Load events for each event type designated by the
id field. If the table of an event type doesn't exist, create it. If the table for an event type exists, drop any event data for the given
periodID from the table and append data from all avro files in the
files field to the event type table.
Loading data files into your data warehouse
When loading the resulting avro files into your data warehouse, you must ensure that you replace the previous event-type description files with each export and use logical avro type mapping.
Loading event-type descriptions
The latest version of the Pendo event-type description files are sent in each export. You must replace this data when loading each Data Sync export. The list of files to be loaded are referenced by the following fields:
Wherever avro files are represented as a list, it's possible for that list to be empty. This signifies that content for that period should be dropped. This can occur, for example, in the following scenario:
- A Page was previously tagged and matched a non-zero number of events in a period.
- The Page’s matching rules got updated.
- Pendo reprocessed events for that period, and the new rules don't match any event.
Loading time-dependent events
timeDependent block of the bill of materials contains data that is associated with a particular
periodId value. The
periodId signifies which logical time period a particular event resides in. This period equates to a day of event data. You will never find the same event in two different periods.
Load the event avro files with your data warehouse’s option to use logical avro type mapping. This way, the
browserTimestamp value is loaded as a TIMESTAMP and
periodId is loaded as a DATE data type.
Prior to loading any of the avro files in the
timeDependent block, you must drop any data that corresponds to an event type (Page, Feature, or Track) and
periodId before loading the new data.
Pendo can send updates to a given <event type>/<periodId> and if you aren't replacing the data then you can start to accumulate duplicate data in your data warehouse. Re-sending event data can happen when updating event-type rule definitions.