Data Sync export handling

Last updated:

Data Sync allows you to send product data directly from Pendo to your preferred cloud storage destination. From there, it can be automatically pulled into your centralized data lake or warehouse on a regular cadence.

This article is written to help you understand the flow of Data Sync and to prepare you for implementation. Specifically, this article:

  • Outlines prerequisite steps.
  • Explains the flow and frequency of exports.
  • Clarifies the hierarchy and purposes of files contained within an export.
  • Details how to load an export into your data lake or warehouse.

Prerequisites

Setup

Broadly, setting up Data Sync involves the following tasks:

  1. Create and manage destinations
  2. Create exports

Create and manage destinations

Setting up Data Sync involves configuring a cloud storage destination and defining this in Pendo. The exact process and prerequisites depend on the destination you choose:

After you’ve configured a destination, you can find and optionally update the bucket path by selecting Manage destination from the Data Sync page.

Settings_DataSync_ManageDestination.png

Create exports

After your destination is saved (by following the process outlined in Create and manage destinations), Data Sync customers can create one of two types of exports.

  • A daily Recurring export, which sends two unique days worth of data: yesterday's data and the day of finalized data from approximately 9 days ago. For more information about processing finalized data, see Finalized days.
  • A historical One-time export, which sends up to three calendar years of data. During implementation of Data Sync, generating smaller one-time exports of just a few days of data can help you understand the data structure.

Note: If Data Sync isn't yet included in your Pendo subscription, you can create a one-day Test export instead, which is available as a trial feature. For more information, see Overview of Pendo Data Sync.

Pendo also automatically creates a retroactive export when required. For more information, see Retroactive processing in this article.

To create an export:

  1. Go to Settings > Data Sync and select + Create export in the top-right corner of the page.

    DS_CreateExport.png

  2. In the screen that opens enter a meaningful name, choose an application to export, choose the type of export you want to create, and choose a date range.

    Create Export Step 1.png

  3. Select Next: Export Summary.
  4. Check the summary and then select Create export.

Each export is specific to an application in a subscription, delivered to the specific path used as the cloud storage destination in Pendo. For example, if the following path is used, gs://pendo-data/abcde, Pendo creates a datasync folder at gs://pendo-data/abcde/datasync. Exports are then delivered to gs://pendo-data/abcde/datasync/<subscription-id>/<application-id>.

If multiple Pendo subscriptions export to the same destination, they appear in the datasync folder, split into their respective subscription IDs.

Exports

Exports to the datasync folder in your cloud storage consist of a folder for each application. Each application folder contains:

  • An export manifest (exportmanifest.json in the File hierarchy) that references files in the folder for each part of the export and updates after an export completes for an application. The export manifest covers a rolling period of exports generated within the last 30 days.
  • A unique hashed folder (export-uuid in the File hierarchy) consisting of a bill of materials file for the most recent export, and event data or business object metadata in avro files.

File names alone don't fully explain their contents, and so it's also important to understand:

File hierarchy

The following diagram provides an overview of the structure of files included in an export. For more details about the information contained in these files, see File descriptions in this article, and the Data Sync schema definitions article.

gs://pendo-data/datasync/<subscription-id>/<application-id>/

├── exportmanifest.json
└── <export-uuid>/
   ├── billofmaterials.json
   ├── allevents.avro
   ├── allfeatures.avro
   ├── allguides.avro
   ├── allpages.avro
   ├── alltracktypes.avro
   └── matchedEvents/
       ├── Feature/
       │   └── <feature_id>.avro
       ├── Page/
       │   └── <page_id>.avro
       └── TrackType/
            └── <track_id>.avro

 

This structure of files provides the following information, which is also depicted in the following color-coded image:

Screenshot 2024-05-07 at 13.41.09.png

Unique identifiers

You can extract more information about the application and subscription from Pendo’s Aggregations API using the unique identifiers in the path at the top of the hierarchy: subscription-id and application-id.

A folder of event data, which is named according to the unique export identifier, exists at the same level in the hierarchy as the export manifest (exportmanifest.json).

Export management files

There are two lists of export contents:

  • The export manifest (exportmanifest.json), which is a concatenated list of daily bills of materials. For more information, see Export manifest in this article.
  • The daily bill of materials (billofmaterials.json) within the individual export, which is the JSON representation of the export's contents used by ETL automation to load exported avro event files into a data warehouse or data lake. For more information, see Bill of materials in this article.

Event files

There are two types of event data files:

  • The allevents.avro file, which contains all event data, including events not matched to Page or Feature rules, and including all guide events.
  • Individual avro files for each defined Page, Feature, or Track Event located in the matchedEvents/ folder. Matchable IDs correspond to the identifiers available in the definition files and in the URL when viewing the Page, Feature, or Track Event details from the Pendo application.

Each event also corresponds to a Visitor ID and Account ID.

Definition files

Find business object metadata in avro files for Features, Pages, Guides, and Track Events alongside the billofmaterials.json and allevents.avro. These files can be referenced to provide more details about an event using the matchableId.

File descriptions

The following table provides a description of the files contained in an export. For more information about the data included in exported files, see Data Sync schema definitions.

The file names are relative. An absolute file name can be obtained by prepending the rootUrl field from the export manifest to the relative file name. The rootUrl also corresponds to the path of the billofmaterials.json file (excluding its filename).

File sizes are capped at approximately 3 GB. If the file size exceeds 3 GB, it's automatically split appropriately and the same period ID is delivered in an additional file.

Content type File name Description
Definition management file: Record of files over several exports, plus some metadata. exportmanifest.json Contains a concatenated list of files for each part of the daily export, covering a rolling window of exports generated within the last 30 days. The counter iterates with each new export received.; zero (0) means there's nothing to do and the content can be deleted.
Export management file: Record of export contents. billofmaterials.json A JSON representation of the export contents. This is used by ETL automation to load exported avro event files into a data warehouse.
Event file. allevents.avro All event data. This includes both Pendo events that are associated with a Page, Feature, or Track Event as well as events that are not associated.
Definition files: Additional details about events (not event data itself). allfeatures.avro List of the application's exported Features and additional metadata for each Feature based on matchableID.
allguides.avro List of the application's exported guides and additional metadata for each guide based on matchableID.
allpages.avro List of the application's exported Pages and additional metadata for each Page based on matchableID.
alltracktypes.avro List of the application's exported Track Events and additional metadata for each Track Event based on matchableID.
Event files: Event data separated by Pages, Features, and Track Events. Events that don't match a definition are excluded. <feature_id>.avro

All events for the given Feature ID. This Feature ID value is the unique identifier that is found in the URL of the Pendo app when viewing a Feature's details.

<page_id>.avro

All events for the given Page ID. This Page ID value is the unique identifier that is found in the URL of the Pendo app when viewing a Page's details.

<track_id>.avro

All events for the given Track Event ID. This Track Event ID value is the unique identifier that is found in the URL of the Pendo app when viewing a Track Event's details.

Bill of materials

The bill of materials documents the contents of the export. The following code snippet is an example of what the billofmaterials.json looks like for a Data Sync export.

{
   "timestamp": "2023-02-16T20:21:11Z",
   "numberOfFiles": 65,
   "application": {
     "displayName": "Acme CRM",
     "id": "-323232"
   },
   "subscription": {
     "displayName": "(Demo) Pendo Experience",
     "id": "6591622502678528"
   },
   "pageDefinitionsFile": [
     "allpages.avro"
   ],
   "featureDefinitionsFile": [
     "allfeatures.avro"
   ],
   "trackTypeDefinitionsFile": [
     "alltracktypes.avro"
   ],
   "guideDefinitionsFile": [
     "allguides.avro"
   ],
   "timeDependent": [
     {
       "periodId": "2023-02-22T00:00:00Z"
       "allEvents": {
         "eventCount": 9515,
         "files": [
           "allevents.avro"
         ]
       },
       "matchedEvents": [
         {
           "eventCount": 48314,
           "files": [
             "matchedEvents/Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688.avro"
           ],
           "id": "Page/OMZ5WpI3HXIhNIIf8Sl_5zJF688",
           "type": "Page"
         },
   ]
}

Export manifest

The export manifest is a key file for reading and ingesting exports. The export manifest is a concatenation of multiple bills of materials, with some additional metadata. It consists of a rolling record of the past 30-day period of Data Sync activity, regardless of which dates of data are exported.

While the bill of materials provides details of everything in a single export, the export manifest operates at a higher level, allowing you to keep track of what’s happening with all of your exports over time. This allows you to iterate through the counters, which is important for updates to exported data, where previously exported files might change.

The following code snippet is an example of what the exportmanifest.json looks like for a Data Sync export, excluding parts that overlap with the billofmaterials.json.

{
  "exports": [
    {
     // complete billofmaterials object present but omitted for brevity
"exportType": [...],
      "counter": 1,
      "finishTime": "2023-03-03T14:10:15.311651Z",
      "storageSize": 12130815,
      "rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/0f39bdf6-09c2-4e4d-6d4f-b02c961d8aaf"
    },
    {
    // complete billofmaterials object present but omitted for brevity
"exportType": [...],
      "counter": 2,
      "finishTime": "2023-03-03T14:20:12.9489274",
      "storageSize": 23462682,
      "rootUrl": "gs://pendo-data/datasync/6591622502678528/-323232/b979502c-1a01-4569-74cf-e4a7f5049d8f"
    }
  ],
  "generatedTime": "2023-03-05T04:17:59.853205005Z"
}

exportType can be one of the following:

  • null if the export is a one-time or a recurring export.
  • ["Retroactive"] if the export is a retroactive export.
  • ["Test"] if the export is a test export. This is only possible for non-paying Data Sync customers who run a single test export.

The export manifest only reflects exports and the subsequent files that have been completely loaded into your cloud storage. There is never a partial export in the export manifest. An export listed in the export manifest is always a complete export, although it can contain unfinalized data, requiring replacement later.

Updates to exported data

There are two instances where data for days previously exported are automatically re-exported by Pendo: finalized days and retroactive processing. These exports can be handled the same way as regular exports because the directory structure and schema remain the same. To ensure that a duplicate day (period ID) isn't accidentally ingested, ensure that your ETL process includes the appropriate "drop and replace" logic.

Finalized days

Pendo event data is finalized after approximately 9 days (the exact number of days before finalizing depends on your local time zone).

Until data is finalized, small updates can occur as Pendo collects additional data, for example, if session information from yesterday isn't captured until today because a user left their browser window open. These changes are handled seamlessly in the Pendo application. Though the volume of changes is minimal, Data Sync accommodate those changes with automatic exports. This ensures that Pendo data in your data warehouse stays aligned with Pendo data in the Pendo application over time.

To reflect these changes in Data Sync exports, every daily recurring export includes two unique exports:

  • One for yesterday's data, which is unfinalized.
  • One for finalized data, typically 8-10 days old.

For example:

Date of export Yesterday's data Finalized data
April 15, 2024 April 14, 2024 From April 7, 2024, which replaces an unfinalized export received April 8, 2024

The following changes to data could occur between yesterday’s export and a finalized export:

  • An event that doesn't appear in yesterday’s data could appear in finalized data.
  • An event that appears in yesterday’s data could appear in finalized data.
  • An event that's only in allevents.avro in the unfinished data could appear in matchedEvents finalized data.

The browserTimestamp for a given event doesn't change between unfinalized and finalized data.

Retroactive processing

When rules for Features and Pages are added or updated, Pendo applies new or updated tags to the lifetime of your data since Pendo was installed. 

To keep your previously exported data up-to-date, we initiate a retroactive export once a day when rules are added or changed. This re-exports data for the relevant Pages and Features for any previous export days. The schema of these exports are the same as a regular export, but there is no allevents.avro file included because retroactive exports only re-export event data for the reprocessed Feature or Page. The definition files for these Features and Pages (such as allpages.avro is included).

If you receive data for a period that was already loaded into your warehouse, you must fully drop the corresponding event data for the period ID before loading in the new data. This ensures that you don't pull in duplicate data. The same logic accounts for both finalized data and retroactive exports. The only difference is that retroactive exports provide a smaller subset of just the relevant data for that change. For more information, see Step 3. Iterate over time-dependent events.

Example load flow

This example creates a separate table in the data warehouse for each event type file. You can load data to suit your needs as long as the data is replaced correctly.

Step 1. Read the most recent export manifest

Read the most recent exportmanifest.json file to find all unprocessed exports since the last time data was loaded. You can use the counter field as a marker for load progress.

Step 2. Iterate over each entry in the exports list

Cycle through the entries in the list to process the entries and load them into a table.

The latest version of the Pendo event-type description files are sent in each export. When loading these files into your data warehouse, you must ensure that you replace the previous event-type description files with each export and use logical avro type mapping.

  1. Load allpages.

    If the allpages table doesn't exist, create it. If the allpages table exists, drop all data. Then, load all avro files from the list pointed to by the pageDefinitionsFile field into the allpages table.

  2. Load allfeatures.

    If theallfeatures table doesn't exist, create it. If the allfeatures table exists, drop all data. Then, load all avro files from the list pointed to by the featureDefinitionsFile field into the allfeatures table.

  3.  Load alltracktypes.

    If the alltracktypes table doesn't exist, create it. If the alltracktypes table exists, drop all data. Then, load all avro files from the list pointed to by the trackTypeDefinitionsFile field into the alltracktypes table.

  4. Load allguides.

    If the allguides table doesn't exist, create it. If allguides table exists, drop all data. Then, load all avro files from the list pointed to by the guideDefinitionsFile field into the allguides table.

Step 3. Iterate over time-dependent events

The timeDependent block in the bill of materials contains data that's associated with a particular periodID value. The periodID signifies which logical time period a particular event resides in. This period equates to a day of event data. The same event is never in two different periods.

Iterate over all items in the timeDependent list. Then, load allevents.avro with logical type mapping.  If the allevents table doesn't exist, create it. If the allevents table exists, drop any event data for the given periodID from the table and append data from the allevents.avro file to the allevents table. This is to avoid duplication.

Then iterate over all items in the matchedEvents list. Load events for each event type designated by the id field. If the table of an event type doesn't exist, create it. If the table for an event type exists, drop any event data for the given periodID from the table and append data from all avro files in the files field to the event type table.

 

          1.  
Was this article helpful?
1 out of 8 found this helpful