After Data Sync is configured, Pendo exports your Pendo data to the provided cloud storage. This article describes how to load these exports into your data warehouse.
- Data Sync configured, with at least one destination. For more information, see Set up Data Sync with Google Cloud or Set up Data Sync with Amazon S3.
- Knowledge of data warehouse destination commands or SDK to create and load tables from an avro file source.
A configured Pendo Data Sync export is delivered to the cloud storage bucket path provided during Data Sync setup once per day. This involves creating a cloud storage bucket and a service account that can access that bucket, along with a secret key that gives Pendo the necessary permissions to write data into the bucket.
You can find the bucket path you created in the Destinations table in Settings > Data Management > Bulk Exports > Destinations under Bucket Address.
Once your destination is saved as per the above process, you create an export. Daily exports are delivered to the path that was used as the cloud storage location in Pendo. For example, if the following path was used:
gs://pendo-data, then the daily exports would be delivered to
We create a folder inside of the
datasync folder in your cloud storage for each application that Pendo exports data for. In each application folder, there is an export manifest and a unique hashed folder for each export. The hashed folder holds all the avro files and a bill of materials file for the most recent export. For a description of these files, see Penda Data Sync schema definitions.
This directory would contain the following files and directories. Each of the tables corresponds with each of the avro files you exported from Pendo.
│ └── <feature_id>.avro
│ └── <page_id>.avro
For more information see Pendo Data Sync schema definitions.
||Concatenated list of daily export
||A JSON representation of the export contents. This is used by ETL automation to load exported avro event files into a data warehouse.|
||All event data. This includes both Pendo events that are associated with a Page, Feature, or Track Event as well as events that are not associated.|
||Description of the application’s exported Features.|
||Description of application’s exported Guides.|
||Description of the application’s exported Pages.|
||Description of application’s exported Track Events.|
All events for the given Feature ID. This Feature ID value is the unique identifier that is found in the Pendo UI.
All events for the given Page ID. This Page ID value is the unique identifier that is found in the Pendo UI.
All events for the given Track Event ID. This Track Event ID value is the unique identifier that is found in the Pendo UI.
The file names are relative. An absolute file name can be obtained by prepending the
rootUrl field from the export manifest to the relative file name. The
rootUrl also corresponds to the path of the
billofmaterials.json file (excluding its filename).
Bill of materials
The bill of materials documents the contents of the export. Below is an example of what the billofmaterial.json looks like for a Data Sync export.
"displayName": "Acme CRM",
"displayName": "(Demo) Pendo Experience",
The export manifest is a concatenation of multiple bills of materials, with some additional metadata. Below is an example of what the export manifest looks like for a Data Sync export.
Example load flow
This example creates a separate table in the data warehouse for each event type file. You can load data to suit your needs as long as the data is replaced correctly.
Step 1. Read the most recent export manifest
Read the most recent
exportmanifest.json file to find all unprocessed exports since the last time data was loaded. You can use the counter field as a marker for load progress.
Step 2. Iterate over each entry in the exports list
Cycle through the entries in the list to process the entries and load them into a table.
allpagestable doesn't exist, create it. If the
allpagestable exists, drop all data. Then, load all avro files from the list pointed to by the
pageDefinitionsFilefield into the
allfeaturestable doesn't exist, create it. If the
allfeaturestable exists, drop all data. Then, load all avro files from the list pointed to by the
featureDefinitionsFilefield into the
alltracktypestable doesn't exist, create it. If the
alltracktypestable exists, drop all data. Then, load all avro files from the list pointed to by the
trackTypeDefinitionsFilefield into the
allguidestable doesn't exist, create it. If
allguidestable exists, drop all data. Then, load all avro files from the list pointed to by the
guideDefinitionsFilefield into the
Step 3. Iterate over time-dependent events
Iterate over all items in the
timeDependent list. Then, load
allevents.avro. If the
allevents table doesn't exist, create it. If the
allevents table exists, drop any event data for the given
periodID from the table and append data from the
allevents.avro file to the
Then iterate over all items in the
matchedEvents list. Load events for each event type designated by the
id field. If the table of an event type doesn't exist, create it. If the table for an event type exists, drop any event data for the given
periodID from the table and append data from all avro files in the
files field to the event type table.
Loading data files into your data warehouse
When loading the resulting avro files into your data warehouse, you must ensure that you replace the previous event-type description files with each export and use logical avro type mapping.
Loading event-type descriptions
The latest version of the Pendo event-type description files are sent in each export. You must replace this data when loading each Data Sync export. The list of files to be loaded are referenced by the following fields:
Wherever avro files are represented as a list, it's possible for that list to be empty. This signifies that content for that period should be dropped. This can occur, for example, in the following scenario:
- A Page was previously tagged and matched a non-zero number of events in a period.
- The Page’s matching rules got updated.
- Pendo reprocessed events for that period, and the new rules don't match any event.
Loading time-dependent events
timeDependent block of the bill of materials contains data that is associated with a particular
periodId value. The
periodId signifies which logical time period a particular event resides in. This period equates to a day of event data. You will never find the same event in two different periods.
Load the event avro files with your data warehouse’s option to use logical avro type mapping. This way, the
browserTimestamp value is loaded as a TIMESTAMP and
periodId is loaded as a DATE data type.
Prior to loading any of the avro files in the
timeDependent block, you must drop any data that corresponds to an event type (Page, Feature, or Track) and
periodId before loading the new data.
Pendo can send updates to a given <event type>/<periodId> and if you aren't replacing the data then you can start to accumulate duplicate data in your data warehouse. Re-sending event data can happen when updating event-type rule definitions.