Data Sync export handling

Last Updated:

After Data Sync is configured, Pendo exports your Pendo data to the provided cloud storage. This article describes how to load these exports into your data warehouse.

Prerequisites

Loading overview

A configured Pendo Data Sync export is delivered to the cloud storage bucket path provided during Data Sync setup once per day. This involves creating a cloud storage bucket and a service account that can access that bucket, along with a secret key that gives Pendo the necessary permissions to write data into the bucket. 

You can find the bucket path you created in the Destinations table in Settings > Data Management > Bulk Exports > Destinations under Bucket Address.

Once your destination is saved as per the above process, you create an export. Daily exports are delivered to the path that was used as the cloud storage location in Pendo. For example, if the following path was used: gs://pendo-data, then the daily exports would be delivered to gs://pendo-data/datasync/<application-id>/.

We create a folder inside of the datasync folder in your cloud storage for each application that Pendo exports data for. In each application folder, there is an export manifest and a unique hashed folder for each export. The hashed folder holds all the avro files and a bill of materials file for the most recent export. For a description of these files, see Penda Data Sync schema definitions.

This directory would contain the following files and directories. Each of the tables corresponds with each of the avro files you exported from Pendo.

gs://pendo-data/datasync/<subscription-id>/<application-id>/

├── exportmanifest.json
└── <export-uuid>/
   ├── billofmaterials.json
   ├── allevents.avro
   ├── allfeatures.avro
   ├── allguides.avro
   ├── allpages.avro
   ├── alltracktypes.avro
   └── matchedEvents/
       ├── Feature/
       │   └── <feature_id>.avro
       ├── Page/
       │   └── <page_id>.avro
       └── TrackType/
            └── <track_id>.avro

File description

For more information see Pendo Data Sync schema definitions.

File Name Description
exportmanifest.json Concatenated list of daily export billofmaterials.json.
billofmaterials.json A JSON representation of the export contents. This is used by ETL automation to load exported avro event files into a data warehouse.
allevents.avro All event data. This includes both Pendo events that are associated with a Page, Feature, or Track Event as well as events that are not associated.
allfeatures.avro Description of the application’s exported Features.
allguides.avro Description of application’s exported Guides.
allpages.avro Description of the application’s exported Pages.
alltracktypes.avro Description of application’s exported Track Events.
<feature_id>.avro

All events for the given Feature ID. This Feature ID value is the unique identifier that is found in the Pendo UI.

<page_id>.avro

All events for the given Page ID. This Page ID value is the unique identifier that is found in the Pendo UI.

<track_id>.avro

All events for the given Track Event ID. This Track Event ID value is the unique identifier that is found in the Pendo UI.

 

The file names are relative. An absolute file name can be obtained by prepending the rootUrl field from the export manifest to the relative file name. The rootUrl also corresponds to the path of the billofmaterials.json file (excluding its filename).

Bill of materials

The bill of materials documents the contents of the export. Below is an example of what the billofmaterial.json looks like for a Data Sync export.

{
   "timestamp": "2023-02-16T20:21:11Z",
   "numberOfFiles": 65,
   "application": {
     "displayName": "Acme CRM",
     "id": "-323232"
   },
   "subscription": {
     "displayName": "(Demo) Pendo Experience",
     "id": "6591622502678528"
   },
   "pageDefinitionsFile": [
     "allpages.avro"
   ],
   "featureDefinitionsFile": [
     "allfeatures.avro"
   ],
   "trackTypeDefinitionsFile": [
     "alltracktypes.avro"
   ],
   "guideDefinitionsFile": [
     "allguides.avro"
   ],
   "timeDependent": [
     {
       "periodId": 1675728000,
       "allEvents": {
         "eventCount": 9515,
         "files": [
           "allevents.avro"
         ]
       },
       "matchedEvents": [
         {
           "eventCount": 48314,
           "files": [
             "matchedEvents/Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688.avro"
           ],
           "id": "Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688",
           "type": "Page"
         },
   ]
}

Export manifest

The export manifest is a concatenation of multiple bills of materials, with some additional metadata. Below is an example of what the export manifest looks like for a Data Sync export.

{
  "exports": [
    {
      ...,
      "counter": 1,
      "finishTime": "2023-03-03T14:10:15.311651Z",
      "storageSize": 12130815,
      "rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/0f39bdf6-09c2-4e4d-6d4f-b02c961d8aaf"
    },
    {
      ...,
      "counter": 2,
      "finishTime": "2023-03-03T14:20:12.9489274",
      "storageSize": 23462682,
      "rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/b979502c-1a01-4569-74cf-e4a7f5049d8f"
    }
  ],
  "generatedTime": "2023-03-05T04:17:59.853205005Z"
}

Example load flow

This example creates a separate table in the data warehouse for each event type file. You can load data to suit your needs as long as the data is replaced correctly.

Step 1. Read the most recent export manifest

Read the most recent exportmanifest.json file to find all unprocessed exports since the last time data was loaded. You can use the counter field as a marker for load progress.

Step 2. Iterate over each entry in the exports list

Cycle through the entries in the list to process the entries and load them into a table.

  1. Load allpages.

    If the allpages table doesn't exist, create it. If the allpages table exists, drop all data. Then, load all avro files from the list pointed to by the pageDefinitionsFile field into the allpages table.

  2. Load allfeatures.

    If theallfeatures table doesn't exist, create it. If the allfeatures table exists, drop all data. Then, load all avro files from the list pointed to by the featureDefinitionsFile field into the allfeatures table.

  3.  Load alltracktypes.

    If the alltracktypes table doesn't exist, create it. If the alltracktypes table exists, drop all data. Then, load all avro files from the list pointed to by the trackTypeDefinitionsFile field into the alltracktypes table.

  4. Load allguides.

    If the allguides table doesn't exist, create it. If allguides table exists, drop all data. Then, load all avro files from the list pointed to by the guideDefinitionsFile field into the allguides table.

Step 3. Iterate over time-dependent events

Iterate over all items in the timeDependent list. Then, load allevents.avro.  If the allevents table doesn't exist, create it. If the allevents table exists, drop any event data for the given periodID from the table and append data from the allevents.avro file to the allevents table.

Then iterate over all items in the matchedEvents list. Load events for each event type designated by the id field. If the table of an event type doesn't exist, create it. If the table for an event type exists, drop any event data for the given periodID from the table and append data from all avro files in thefiles field to the event type table.

Loading data files into your data warehouse

When loading the resulting avro files into your data warehouse, you must ensure that you replace the previous event-type description files with each export and use logical avro type mapping. 

Loading event-type descriptions

The latest version of the Pendo event-type description files are sent in each export. You must replace this data when loading each Data Sync export. The list of files to be loaded are referenced by the following fields:

  • pageDefinitionsFile
  • featureDefinitionsFile
  • trackTypeDefinitionsFile
  • guideDefinitionsFile

Wherever avro files are represented as a list, it's possible for that list to be empty. This signifies that content for that period should be dropped. This can occur, for example, in the following scenario:

  • A Page was previously tagged and matched a non-zero number of events in a period.
  • The Page’s matching rules got updated.
  • Pendo reprocessed events for that period, and the new rules don't match any event.

Loading time-dependent events

The timeDependent block of the bill of materials contains data that is associated with a particular periodId value. The periodId signifies which logical time period a particular event resides in. This period equates to a day of event data. You will never find the same event in two different periods.

Load the event avro files with your data warehouse’s option to use logical avro type mapping.  This way, the browserTimestamp value is loaded as a TIMESTAMP and periodId is loaded as a DATE data type.

Prior to loading any of the avro files in the timeDependent block, you must drop any data that corresponds to an event type (Page, Feature, or Track) and periodId before loading the new data.

Pendo can send updates to a given <event type>/<periodId> and if you aren't replacing the data then you can start to accumulate duplicate data in your data warehouse.  Re-sending event data can happen when updating event-type rule definitions.

          1.