Data Sync export handling

Last updated:

After Data Sync is configured, Pendo exports your Pendo data to the provided cloud storage. This article describes how to load these exports into your data warehouse.

For sample code showing the processing Data Sync exports from Google Cloud Storage to Google BigQuery, see data-sync-gcp-export-loading-example.

Prerequisites

File processing

We currently cap the Avro file size at 4GB. Because these files can get large, they are intended to be streamed into your ETL and warehouse.

Loading overview

A configured Pendo Data Sync export is delivered to the cloud storage bucket path provided during Data Sync setup once per day. This involves creating a cloud storage bucket and a service account that can access that bucket, along with a secret key that gives Pendo the necessary permissions to write data into the bucket. 

You can find the bucket path you created in the Destinations table in Settings > Data Management > Bulk Exports > Destinations under Bucket Address.

Once your destination is saved as per the above process, you create an export. Daily exports are delivered to the path that was used as the cloud storage location in Pendo. For example, if the following path was used: gs://pendo-data, then the daily exports would be delivered to gs://pendo-data/datasync/<subscription-id>/<application-id>/.

We create a folder inside of the datasync folder in your cloud storage for each application that Pendo exports data. In each application folder, there is an export manifest, which gets updated after an export completes for that application, and a unique hashed folder for each export run. The hashed folder holds all the avro files and a bill of materials file for the most recent export. For a description of these files, see Pendo Data Sync schema definitions.

This directory would contain the following files and directories. Each of the tables corresponds with each of the avro files you exported from Pendo.

gs://pendo-data/datasync/<subscription-id>/<application-id>/

├── exportmanifest.json
└── <export-uuid>/
   ├── billofmaterials.json
   ├── allevents.avro
   ├── allfeatures.avro
   ├── allguides.avro
   ├── allpages.avro
   ├── alltracktypes.avro
   └── matchedEvents/
       ├── Feature/
       │   └── <feature_id>.avro
       ├── Page/
       │   └── <page_id>.avro
       └── TrackType/
            └── <track_id>.avro

Updates to exported data

There are a few cases where data for days previously exported need to be re-exported. These exports can be handled the same way as regular exports because the directory structure and schema remain the same.

Retroactive processing

When rules for Features and Pages are added or updated, Pendo applies new or updated tags to the lifetime of your data since Pendo was installed. The reprocessed data is handled seamlessly in the Pendo UI.

To keep your previously-exported data up-to-date with the reprocessing that occurred in the Pendo UI, we initiate a retroactive export whenever rules are added or changed, which re-exports data for any days previously exported through an existing historical or recurring export. The contents of this export are the same as of a regular export, with the following exceptions:

  • There is no allevents.avro file because retroactive exports only re-export event data for the reprocessed Feature or Page.

  • We re-export event data for the reprocessed Feature or Page for any dates previously exported by a historical or recurring export.

Unfinished days

Pendo event data is finalized after eight days. Until that time, small changes can occur to your aggregate data as Pendo collects additional data, for example, if session information from yesterday isn't captured until today because a user left their browser window open. These changes are handled seamlessly in the Pendo UI.

To reflect these changes in Data Sync exports, we send two separate exports for finalized data (eight days out) and unfinished data (from yesterday) for a daily export. For example, for today’s daily export, you can expect two exports in the form of two uniquely hashed folders to be delivered to gs://pendo-data/datasync/<subscription-id>/<application-id>/: one containing data from eight days ago, and one containing data from yesterday. This also means that in eight days, you receive a finalized version of today’s data.

Prior to July 14, 2023, only finalized data was exported. If you enabled a recurring export before this date, you see two exports instead of one for each day for finalized and unfinished data starting on July 14, 2023. Unfinished data isn't exported for the seven days prior to this date. Those days are covered when the finalized data is sent for those days in the seven days following July 14, 2023.

File description

For more information see Pendo Data Sync schema definitions.

File Name Description
exportmanifest.json Concatenated list of daily export billofmaterials.json.
billofmaterials.json A JSON representation of the export contents. This is used by ETL automation to load exported avro event files into a data warehouse.
allevents.avro All event data. This includes both Pendo events that are associated with a Page, Feature, or Track Event as well as events that are not associated.
allfeatures.avro Description of the application’s exported Features.
allguides.avro Description of application’s exported Guides.
allpages.avro Description of the application’s exported Pages.
alltracktypes.avro Description of application’s exported Track Events.
<feature_id>.avro

All events for the given Feature ID. This Feature ID value is the unique identifier that is found in the Pendo UI.

<page_id>.avro

All events for the given Page ID. This Page ID value is the unique identifier that is found in the Pendo UI.

<track_id>.avro

All events for the given Track Event ID. This Track Event ID value is the unique identifier that is found in the Pendo UI.

 

The file names are relative. An absolute file name can be obtained by prepending the rootUrl field from the export manifest to the relative file name. The rootUrl also corresponds to the path of the billofmaterials.json file (excluding its filename).

Bill of materials

The bill of materials documents the contents of the export. Below is an example of what the billofmaterial.json looks like for a Data Sync export.

{
   "timestamp": "2023-02-16T20:21:11Z",
   "numberOfFiles": 65,
   "application": {
     "displayName": "Acme CRM",
     "id": "-323232"
   },
   "subscription": {
     "displayName": "(Demo) Pendo Experience",
     "id": "6591622502678528"
   },
   "pageDefinitionsFile": [
     "allpages.avro"
   ],
   "featureDefinitionsFile": [
     "allfeatures.avro"
   ],
   "trackTypeDefinitionsFile": [
     "alltracktypes.avro"
   ],
   "guideDefinitionsFile": [
     "allguides.avro"
   ],
   "timeDependent": [
     {
       "periodId": 1675728000,
       "allEvents": {
         "eventCount": 9515,
         "files": [
           "allevents.avro"
         ]
       },
       "matchedEvents": [
         {
           "eventCount": 48314,
           "files": [
             "matchedEvents/Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688.avro"
           ],
           "id": "Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688",
           "type": "Page"
         },
   ]
}

Export manifest

The export manifest is a concatenation of multiple bills of materials, with some additional metadata. The proceeding code snippet is an example of what the export manifest looks like for a Data Sync export.

{
  "exports": [
    {
      ...,
"exportType": [...],
      "counter": 1,
      "finishTime": "2023-03-03T14:10:15.311651Z",
      "storageSize": 12130815,
      "rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/0f39bdf6-09c2-4e4d-6d4f-b02c961d8aaf"
    },
    {
      ...,
"exportType": [...],
      "counter": 2,
      "finishTime": "2023-03-03T14:20:12.9489274",
      "storageSize": 23462682,
      "rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/b979502c-1a01-4569-74cf-e4a7f5049d8f"
    }
  ],
  "generatedTime": "2023-03-05T04:17:59.853205005Z"
}

exportType can be one of the following:

  • null if the export is a one-time or a recurring export.
  • ["Retroactive"] if the export is a retroactive export.
  • ["Test"] if the export is a test export. This is only possible for non-paying Data Sync customers who execute a single test export.

Example load flow

This example creates a separate table in the data warehouse for each event type file. You can load data to suit your needs as long as the data is replaced correctly.

Step 1. Read the most recent export manifest

Read the most recent exportmanifest.json file to find all unprocessed exports since the last time data was loaded. You can use the counter field as a marker for load progress.

Step 2. Iterate over each entry in the exports list

Cycle through the entries in the list to process the entries and load them into a table.

  1. Load allpages.

    If the allpages table doesn't exist, create it. If the allpages table exists, drop all data. Then, load all avro files from the list pointed to by the pageDefinitionsFile field into the allpages table.

  2. Load allfeatures.

    If theallfeatures table doesn't exist, create it. If the allfeatures table exists, drop all data. Then, load all avro files from the list pointed to by the featureDefinitionsFile field into the allfeatures table.

  3.  Load alltracktypes.

    If the alltracktypes table doesn't exist, create it. If the alltracktypes table exists, drop all data. Then, load all avro files from the list pointed to by the trackTypeDefinitionsFile field into the alltracktypes table.

  4. Load allguides.

    If the allguides table doesn't exist, create it. If allguides table exists, drop all data. Then, load all avro files from the list pointed to by the guideDefinitionsFile field into the allguides table.

Step 3. Iterate over time-dependent events

Iterate over all items in the timeDependent list. Then, load allevents.avro.  If the allevents table doesn't exist, create it. If the allevents table exists, drop any event data for the given periodID from the table and append data from the allevents.avro file to the allevents table.

Then iterate over all items in the matchedEvents list. Load events for each event type designated by the id field. If the table of an event type doesn't exist, create it. If the table for an event type exists, drop any event data for the given periodID from the table and append data from all avro files in thefiles field to the event type table.

Loading data files into your data warehouse

When loading the resulting avro files into your data warehouse, you must ensure that you replace the previous event-type description files with each export and use logical avro type mapping. 

Loading event-type descriptions

The latest version of the Pendo event-type description files are sent in each export. You must replace this data when loading each Data Sync export. The list of files to be loaded are referenced by the following fields:

  • pageDefinitionsFile
  • featureDefinitionsFile
  • trackTypeDefinitionsFile
  • guideDefinitionsFile

Wherever avro files are represented as a list, it's possible for that list to be empty. This signifies that content for that period should be dropped. This can occur, for example, in the following scenario:

  • A Page was previously tagged and matched a non-zero number of events in a period.
  • The Page’s matching rules got updated.
  • Pendo reprocessed events for that period, and the new rules don't match any event.

Loading time-dependent events

The timeDependent block of the bill of materials contains data that is associated with a particular periodId value. The periodId signifies which logical time period a particular event resides in. This period equates to a day of event data. You will never find the same event in two different periods.

Load the event avro files with your data warehouse’s option to use logical avro type mapping.  This way, the browserTimestamp value is loaded as a TIMESTAMP and periodId is loaded as a DATE data type.

Prior to loading any of the avro files in the timeDependent block, you must drop any data that corresponds to an event type (Page, Feature, or Track) and periodId before loading the new data.

Pendo can send updates to a given <event type>/<periodId> and if you aren't replacing the data then you can start to accumulate duplicate data in your data warehouse.  Re-sending event data can happen when updating event-type rule definitions.

          1.  
Was this article helpful?
1 out of 6 found this helpful