Developing ETL Integrations for Adobe Experience Platform
The ETL integration guide outlines general steps for creating high-performance, secure connectors for Experience Platform and ingesting data into Platform.
This guide also includes sample API calls to use when designing an ETL connector, with links to documentation that outlines each Experience Platform service, and use of its API, in more detail.
A sample integration is available on GitHub via the ETL Ecosystem Integration Reference Code under the Apache License Version 2.0.
Workflow
The following workflow diagram provides a high-level overview for the integration of Adobe Experience Platform components with an ETL application and connector.
Adobe Experience Platform components
There are multiple Experience Platform components involved in ETL connector integrations. The following list outlines several key components and functionalities:
- Adobe Identity Management System (IMS) - Provides framework for authentication to Adobe services.
- IMS Organization - A corporate entity that can own or license products and services and allow access to its members.
- IMS User - Members of an IMS Organization. The Organization to User relationship is many to many.
- Sandbox - A virtual partition a single Platform instance, to help develop and evolve digital experience applications.
- Data Discovery - Records the metadata of ingested and transformed data in Experience Platform.
- Data Access - Provides users with an interface to access their data in Experience Platform.
- Data Ingestion – Pushes data to Experience Platform with Data Ingestion APIs.
- Schema Registry - Defines and stores schema that describe the structure of data to be used in Experience Platform.
Getting started with Experience Platform APIs
The following sections provide additional information that you will need to know or have on-hand in order to successfully make calls to Experience Platform APIs.
Reading sample API calls
This guide provides example API calls to demonstrate how to format your requests. These include paths, required headers, and properly formatted request payloads. Sample JSON returned in API responses is also provided. For information on the conventions used in documentation for sample API calls, see the section on how to read example API calls in the Experience Platform troubleshooting guide.
Gather values for required headers
In order to make calls to Platform APIs, you must first complete the authentication tutorial. Completing the authentication tutorial provides the values for each of the required headers in all Experience Platform API calls, as shown below:
- Authorization: Bearer
{ACCESS_TOKEN}
- x-api-key:
{API_KEY}
- x-gw-ims-org-id:
{ORG_ID}
All resources in Experience Platform are isolated to specific virtual sandboxes. All requests to Platform APIs require a header that specifies the name of the sandbox the operation will take place in:
- x-sandbox-name:
{SANDBOX_NAME}
All requests that contain a payload (POST, PUT, PATCH) require an additional header:
- Content-Type: application/json
General user flow
To begin, an ETL user logs into the Experience Platform user interface (UI) and creates datasets for ingestion using a standard connector or push-service connector.
In the UI, the user creates the output dataset by selecting a dataset schema. The choice of schema depends on the type of data (record or time series) being ingested into Platform. By clicking on the Schemas tab within the UI, the user will be able to view all available schemas, including the behavior type that the schema supports.
In the ETL tool, the user will start designing their mapping transforms after configuring the appropriate connection (using their credentials). The ETL tool is assumed to already have Experience Platform connectors installed (process not defined in this Integration Guide).
Mockups for a sample ETL tool and workflow have been provided in the ETL workflow. While ETL tools may differ in format, most expose similar functionality.
View list of datasets
Using the source of data for mapping, a list of all available datasets can be fetched using the Catalog API.
You can issue a single API request to view all available datasets (e.g. GET /dataSets
), with best practice being to include query parameters that limit the size of the response.
In cases where full dataset information is being requested the response payload can reach past 3GB in size, which can slow overall performance. Therefore, using query parameters to filter only the information needed will make Catalog queries more efficient.
List filtering
When filtering responses, you can use multiple filters in a single call by separating parameters with an ampersand (&
). Some query parameters accept comma-separated lists of values, such as the “properties” filter in the sample request below.
Catalog responses are automatically metered according to configured limits, however the “limit” query parameter can be used to customize the constraints and limit the number of objects returned. The pre-configured Catalog response limits are:
- If a limit parameter is not specified, the maximum number of objects per response payload is 20.
- The global limit for all other Catalog queries is 100 objects.
- For dataset queries, if observableSchema is requested using the properties query parameter, the maximum number of datasets returned is 20.
- Invalid limit parameters (including
limit=0
) are met with an HTTP 400 error that outlines proper ranges. - If limits or offsets are passed as query parameters, they take precedence over those passed as headers.
Query parameters are covered in more detail in the Catalog Service overview.
API format
GET /catalog/dataSets
GET /catalog/dataSets?{filter1}={value1},{value2}&{filter2}={value3}
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets?limit=3&properties=name,description,schemaRef" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}"
Please refer to the Catalog Service overview for detailed examples of how to make calls to the Catalog API.
Response
The response includes three (limit=3
) datasets showing the “name”, “description”, and “schemaRef” as indicated by the properties
query parameter.
{
"5b95b155419ec801e6eee780": {
"name": "Store Transactions",
"description": "Retails Store Transactions",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/274f17bc5807ff307a046bab1489fb18",
"contentType": "application/vnd.adobe.xed+json;version=1"
}
},
"5c351fa2f5fee300000fa9e8": {
"name": "Loyalty Members",
"description": "Loyalty Program Members",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/fbc52b243d04b5d4f41eaa72a8ba58be",
"contentType": "application/vnd.adobe.xed+json;version=1"
}
},
"5c1823b19e6f400000993885": {
"name": "Web Traffic",
"description": "Retail Web Traffic",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/2025a705890c6d4a4a06b16f8cf6f4ca",
"contentType": "application/vnd.adobe.xed+json;version=1"
}
}
}
View dataset schema
The “schemaRef” property of a dataset contains a URI referencing the XDM schema upon which the dataset is based. The XDM schema (“schemaRef”) represents all potential fields that could be used by the dataset, not necessarily the fields that are being used (see “observableSchema” below).
The XDM schema is the schema you use when you need to present the user with a list of all available fields that could be written to.
The first “schemaRef.id” value in the previous response object (https://ns.adobe.com/{TENANT_ID}/schemas/274f17bc5807ff307a046bab1489fb18
) is a URI that points to a specific XDM schema in the Schema Registry. The schema can be retrieved by making a lookup (GET) request to the Schema Registry API.
properties
query parameter in the previous call. More details on the “schema” property are available in the Dataset “schema” Property section that follows.API format
GET /schemaregistry/tenant/schemas/{url encoded schemaRef.id}
Request
The request uses the URL encoded id
URI of the schema (the value of the “schemaRef.id” attribute) and requires an Accept header.
curl -X GET \
https://platform.adobe.io/data/foundation/schemaregistry/tenant/schemas/https%3A%2F%2Fns.adobe.com%2F{TENANT_ID}%2Fschemas%2F274f17bc5807ff307a046bab1489fb18 \
-H 'Authorization: Bearer {ACCESS_TOKEN}' \
-H 'x-api-key: {API_KEY}' \
-H 'x-gw-ims-org-id: {ORG_ID}' \
-H 'x-sandbox-name: {SANDBOX_NAME}' \
-H 'Accept: application/vnd.adobe.xed-full+json; version=1' \
The response format depends on the type of Accept header sent in the request. Lookup requests also require a version
be included in the Accept header. The following table outlines available Accept headers for lookups:
application/vnd.adobe.xed-id+json
application/vnd.adobe.xed-full+json; version={major version}
application/vnd.adobe.xed+json; version={major version}
application/vnd.adobe.xed-notext+json; version={major version}
application/vnd.adobe.xed-full-notext+json; version={major version}
application/vnd.adobe.xed-full-desc+json; version={major version}
application/vnd.adobe.xed-id+json
and application/vnd.adobe.xed-full+json; version={major version}
are the most commonly used Accept headers. application/vnd.adobe.xed-id+json
is preferred for listing resources in the Schema Registry as it returns only the “title”, “id”, and “version”. application/vnd.adobe.xed-full+json; version={major version}
is preferred for viewing a specific resource (by its “id”), as it returns all fields (nested under “properties”), as well as titles and descriptions.Response
The JSON schema that is returned describes the structure and field-level information (“type”, “format”, “minimum”, “maximum”, etc.) of the data, serialized as JSON. If using a serialization format other than JSON for ingestion (such as Parquet or Scala), the Schema Registry Guide contains a table showing the desired JSON type (“meta:xdmType”) and its corresponding representation in other formats.
Along with this table, the Schema Registry Developer Guide contains in-depth examples of all possible calls that can be made using the Schema Registry API.
Dataset “schema” property (DEPRECATED - EOL 2019-05-30)
Datasets may contain a “schema” property that is now deprecated and remains available temporarily for backwards compatibility. For example, a listing (GET) request similar to the one made previously, where “schema” was substituted for “schemaRef” in the properties
query parameter, might return the following:
{
"5ba9452f7de80400007fc52a": {
"name": "Sample Dataset 1",
"description": "Description of Sample Dataset 1.",
"schema": "@/xdms/context/person"
}
}
If the “schema” property of a dataset is populated, this signals that the schema is a deprecated /xdms
schema and, where supported, the ETL connector should use the value in the “schema” property with the /xdms
endpoint (a deprecated endpoint in the Catalog API) to retrieve the legacy schema.
API format
GET /catalog/{"schema" property without the "@"}
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/xdms/context/person?expansion=xdm" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
expansion=xdm
, tells the API to fully expand and in-line any referenced schemas. You may want to do this when presenting a list of all potential fields to the user.Response
Similar to the steps for viewing dataset schema, the response contains a JSON schema that describes the structure and field-level information of the data, serialized as JSON.
The “observableSchema” property
The “observableSchema” property of a dataset has a JSON structure matching that of the XDM schema JSON. The “observableSchema” contains the fields that were present in the incoming input files. When writing data to Experience Platform, a user is not required to use every field from the target schema. Instead they should supply only those fields that are being used.
The observable schema is the schema that you would use if reading the data or presenting a list of fields that are available to read/map from.
{
"598d6e81b2745f000015edcb": {
"observableSchema": {
"type": "object",
"meta:xdmType": "object",
"properties": {
"name": {
"type": "string",
},
"age": {
"type": "string",
}
}
}
}
}
Preview data
The ETL application may provide a capability to preview data (“Figure 8” in the ETL Workflow). The data access API provides several options to preview data.
Additional information, including step-by-step guidance for previewing data using the data access API, can be found in the data access tutorial.
Get dataset details using the “properties” query parameter
As shown in the steps above to view a list of datasets, you can request “files” using the “properties” query parameter.
You can refer to the Catalog Service overview for detailed information on querying datasets and available response filters.
API format
GET /catalog/dataSets?limit={value}&properties={value}
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets?limit=1&properties=files" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}"
Response
The response will include one dataset (limit=1
) showing the “files” property.
{
"5bf479a6a8c862000050e3c7": {
"files": "@/dataSetFiles?dataSetId=5bf479a6a8c862000050e3c7"
}
}
List dataset files using the “files” attribute
You can also use a GET request to fetch file details using the “files” attribute.
API format
GET /catalog/dataSets/{DATASET_ID}/views/{VIEW_ID}/files
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets/5bf479a6a8c862000050e3c7/views/5bf479a654f52014cfffe7f1/files" \
-H "Accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
The response includes the Dataset File ID as the top-level property, with file details contained within the Dataset File ID object.
{
"194e89b976494c9c8113b968c27c1472-1": {
"batchId": "194e89b976494c9c8113b968c27c1472",
"dataSetViewId": "5bf479a654f52014cfffe7f1",
"imsOrg": "{ORG_ID}",
"availableDates": {},
"createdUser": "{USER_ID}",
"createdClient": "{API_KEY}",
"updatedUser": "{USER_ID}",
"version": "1.0.0",
"created": 1542749145828,
"updated": 1542749145828
},
"14d5758c107443e1a83c714e56ca79d0-1": {
"batchId": "14d5758c107443e1a83c714e56ca79d0",
"dataSetViewId": "5bf479a654f52014cfffe7f1",
"imsOrg": "{ORG_ID}",
"availableDates": {},
"createdUser": "{USER_ID}",
"createdClient": "{API_KEY}",
"updatedUser": "{USER_ID}",
"version": "1.0.0",
"created": 1542752699111,
"updated": 1542752699111
},
"ea40946ac03140ec8ac4f25da360620a-1": {
"batchId": "ea40946ac03140ec8ac4f25da360620a",
"dataSetViewId": "5bf479a654f52014cfffe7f1",
"imsOrg": "{ORG_ID}",
"availableDates": {},
"createdUser": "{USER_ID}",
"createdClient": "{API_KEY}",
"updatedUser": "{USER_ID}",
"version": "1.0.0",
"created": 1542756935535,
"updated": 1542756935535
}
}
Fetch file details
The dataset file IDs returned in the previous response can be used in a GET request to fetch further file details via the Data Access API.
The data access overview contains details on how to use the Data Access API.
API format
GET /export/files/{DATASET_FILE_ID}
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
[
{
"name": "{FILE_NAME}.parquet",
"length": 2576,
"_links": {
"self": {
"href": "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1?path=samplefile.parquet"
}
}
}
]
Preview file data
The “href” property can be used to fetch preview data via the Data Access API.
API format
GET /export/files/{FILE_ID}?path={FILE_NAME}.{FILE_FORMAT}
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1?path=samplefile.parquet" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
The response to the above request will contains a preview of the contents of the file.
More information on the Data Access API, including detailed requests and responses, is available in the data access overview.
Get “fileDescription” from dataset
The destination component as output of transformed data, the Data Engineer will choose an Output Dataset (“Figure 12” in the ETL Workflow). The XDM schema is associated with the output Dataset. The data to be written will be identified by the “fileDescription” attribute of the dataset entity from the Data Discovery APIs. This information can be fetched using a dataset ID ({DATASET_ID}
). The “fileDescription” property in the JSON response will provide the requested information.
API format
GET /catalog/dataSets/{DATASET_ID}
{DATASET_ID}
id
value of the dataset you are trying to access.Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets/59c93f3da7d0c00000798f68" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
{
"59c93f3da7d0c00000798f68": {
"version": "1.0.4",
"fileDescription": {
"persisted": false,
"format": "parquet"
}
}
}
Data will be written to Experience Platform using the Batch Ingestion API. Writing of data is an asynchronous process. When data is written to Adobe Experience Platform, a batch is created and marked as a success only after data is fully written.
Data in Experience Platform should be written in the form of Parquet files.
Execution phase
As the execution starts, the connector (as defined in the source component) will read the data from Experience Platform using the Data Access API. The transformation process will read the data for a certain time range. Internally, it will query batches of source datasets. While querying, it will use a parameterized (rolling for time series data, or incremental data) start date and list dataset files for those batches, and start making requests for data for those dataset files.
Example transformations
The sample ETL transformations document contains a number of example transformations, including identity handling and data-type mappings. Please use these transformations for reference.
Read data from Experience Platform
Using the Catalog API, you can fetch all batches between a specified start time and end time, and sort them by the order they were created.
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches?dataSet=DATASETID&createdAfter=START_TIMESTAMP&createdBefore=END_TIMESTAMP&sort=desc:created" \
-H "Accept: application/json" \
-H "Authorization:Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}"
Details on filtering batches can be found in the Data Access tutorial.
Get files out of a batch
Once you have the ID for the batch you are looking for ({BATCH_ID}
), it is possible to retrieve a list of files belonging to a specific batch via the Data Access API. Details for doing so are available in the Data Access tutorial.
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/batches/{BATCH_ID}/files" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Access files by using file ID
Using the unique ID of a file ({FILE_ID
), the Data Access API can be used to access the specific details of the file, including its name, size in bytes, and a link to download it.
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/{FILE_ID}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "x-api-key: {API_KEY}"
The response may point to a single file, or a directory. Details on each can be found in the Data Access tutorial.
Access file content
The Data Access API can be used to access the contents of a specific file. To fetch the contents, a GET request is made using the value returned for _links.self.href
when accessing a file using the file ID.
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/{DATASET_FILE_ID}?path=filename1.csv" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "x-api-key: {API_KEY}"
The response to this request contains the contents of the file. For more information, including details on response pagination, see the How to Query Data via data access API tutorial.
Validate records for schema compliance
When data is being written, users can opt to validate data according to the validation rules defined in the XDM schema. More information on schema validation can be found in the ETL Ecosystem Integration Reference Code on GitHub.
If you are using the reference implementation found on GitHub, you can turn on schema validation in this implementation using the system property -DenableSchemaValidation=true
.
Validation can be performed for logical XDM types, using attributes such as minLength
and maxlength
for strings, minimum
and maximum
for integers, and more. The Schema Registry API developer guide contains a table that outlines XDM types and the properties that can be used for validation.
integer
types are the MIN and MAX values that the type can support, but these values can be further constrained to minimums and maximums of your choosing.Create a batch
Once the data is processed, the ETL tool will write the data back to Experience Platform using the Batch Ingestion API. Before data can be added to a dataset, it must be linked to a batch which will later be uploaded into a specific dataset.
Request
curl -X POST "https://platform.adobe.io/data/foundation/import/batches" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-d '{
"datasetId":"{DATASET_ID}"
}'
Details for creating a batch, including sample requests and responses can be found in the Batch Ingestion overview.
Write to dataset
After successfully creating a new batch, files can then be uploaded to a specific dataset. Multiple files can be posted in a batch until it is promoted. Files can be uploaded using the Small File Upload API; however, if your files are too large and the gateway limit is exceeded, you can use the Large File Upload API. Details for using both Large and Small File Upload can be found in the Batch Ingestion overview.
Request
Data in Experience Platform should be written in the form of Parquet files.
curl -X PUT "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}/dataSets/{DATASET_ID}/files/{FILE_NAME}.parquet" \
-H "accept: application/json" \
-H "x-gw-ims-org-id:{ORG_ID}" \
-H "Authorization:Bearer ACCESS_TOKEN" \
-H "x-api-key: API_KEY" \
-H "content-type: application/octet-stream" \
--data-binary "@{FILE_PATH_AND_NAME}.parquet"
Mark batch upload complete
After all files have been uploaded to the batch, the batch can be signaled for completion. By doing this, the Catalog “DataSetFile” entries are created for the completed files and associated with the generate batch. The Catalog batch is then marked as successful, which triggers downstream flows to ingest the available data.
Data will first land in the staging location on Adobe Experience Platform and then will be moved to the final location after cataloging and validation. Batches will be marked as successful once all the data is moved to a permanent location.
Request
curl -X POST "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}?action=COMPLETE" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization:Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
If successful, the response will return HTTP Status 200 OK and the response body will be empty.
The ETL tool will make sure to note the timestamp of source dataset(s) as the data is read.
In next transformation execution, likely by schedule or event invocation, the ETL will start requesting the data from the previously-saved timestamp and all data going forward.
Get last batch status
Before running new tasks in the ETL tool, you must ensure that the last batch was successfully completed. The Catalog Service API provides a batch-specific option which provides the details of the relevant batches.
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches?limit=1&sort=desc:created" \
-H "Accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
New tasks can be scheduled if the previous batch “status” value is “success” as shown below:
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "{API_KEY}",
"createdUser": "CLIENT_USER_ID@AdobeID",
"updatedUser": "CLIENT_USER_ID@AdobeID",
"updated": 1494349963467,
"status": "success",
"errors": [],
"version": "1.0.1",
"availableDates": {}
}
Get last batch status by ID
An individual batch status can be retrieved through the Catalog Service API by issuing a GET request using the {BATCH_ID}
. The {BATCH_ID}
used would be the same as the ID returned when the batch was created.
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches/{BATCH_ID}" \
-H "Accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response - Success
The following response shows a “success”:
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "{API_KEY}",
"createdUser": "{CREATED_USER}",
"updatedUser": "{UPDATED_USER}",
"updated": 1494349962314,
"status": "success",
"errors": [],
"version": "1.0.1",
"availableDates": {}
}
Response - Failure
In case of failure the “errors” can be extracted from the response and surfaced on the ETL tool as error messages.
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "{API_KEY}",
"createdUser": "{CREATED_USER}",
"updatedUser": "{UPDATED_USER}",
"updated": 1494349962314,
"status": "failure",
"errors": [
{
"code": "200",
"description": "Error in validating schema for file: 'adl://dataLake.azuredatalakestore.net/connectors-dev/stage/BATCHID/dataSetId/contact.csv' with errorMessage=adl://dataLake.azuredatalakestore.net/connectors-dev/stage/BATCHID/dataSetId/contact.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [57, 98, 55, 10] and errorType=java.lang.RuntimeException",
"rows": []
}
],
"version": "1.0.1",
"availableDates": {}
}
Incremental vs snapshot data and events vs profiles
Data can be represented in a two by two matrix as follows:
Event data is typically when there are indexed time stamp columns in each row.
Profile data is typically when there is not a time stamp in data and each row can be identified by a primary/composite key.
Incremental data is where only new/updated data comes into the system and appends to current data in the datasets.
Snapshot data is when all data comes into the system and replaces some or all previous data in a dataset.
In the case of incremental events, the ETL tool should use the available dates/create date of the batch entity. In case of push service, available dates will not be present, so tool will use batch created/updated date for marking increments. Every batch of incremental events is required to be processed.
For incremental profiles, ETL tool will use created/updated dates of batch entity. Commonly every batch of incremental profile data is required to be processed.
Snapshot events are very less likely due to sheer size of the data. But if this were to be required, the ETL tool must pick only the last batch for processing.
When snapshot profiles are used, the ETL tool will have to pick the last batch of the data that arrived in the system. But if requirement is to keep track of the versions of changes, then all batches will be required to be processed. De-duplication processing within the ETL process will help in controlling storage costs.
Batch replay and data reprocessing
Batch replay and data reprocessing may be required in cases where a client discovers that for the past ‘n’ days, data being ETL processed has not occurred as expected or source data itself may not have been correct.
To do this, the client’s data administrators will use the Platform UI to remove the batches containing corrupt data. Then, the ETL will likely need to be re-run, thus repopulating with correct data. If the source itself had corrupt data, the data engineer/administrator will need to correct the source batches and re-ingest the data (either into Adobe Experience Platform or via ETL connectors).
Based upon the type of data being generated, it will be the data engineer’s choice to remove a single batch or all batches from certain datasets. Data will be removed/archived as per Experience Platform guidelines.
It is a likely scenario that the ETL functionality to purge data will be important.
Once purging is complete, the client admins will have to reconfigure Adobe Experience Platform to restart processing for core services from the time when the batches are deleted.
Concurrent batch processing
At the client’s discretion, data admins/engineers may decide to extract, transform, and load data in sequential manner or concurrent manner depending of the characteristics of a particular dataset. This will also be based upon the use case the client is targeting with the transformed data.
For example, if the client is persisting to an updatable persistence store and the sequence or order of events is important, the client may need to strictly process jobs with sequential ETL transformations.
In other cases, out of order data can be processed by downstream applications/processes that internally sort using a specified time stamp. In those cases, parallel ETL transformations may be viable to improve processing times.
For source batches, it will again be dependent upon client preference and consumer constraint. If the source data can be picked up in parallel without regard to the regency/ordering of a row, then the transformation process can create process batches with a higher degree of parallelism (optimization based on out of order processing). But if the transform has to honor time stamps or change precedence ordering, the data access API or ETL tool scheduler/invocation will have to ensure that batches are not processed out of order where possible.
Deferral
Deferral is a process in which input data is not yet complete enough to be sent out to downstream processes, but may be usable in the future. Clients will determine their individual tolerance for data windowing for future matching versus the cost of processing to inform their decision to put aside data and reprocess it in the next transformation execution, hoping it can be enriched and reconciled/stitched at some future time inside the retention window. This cycle is ongoing until the row is processed sufficiently or it is deemed too stale to continue investing in. Every iteration will generate deferred data which is a superset of all deferred data in previous iterations.
Adobe Experience Platform does not identify deferred data currently, so client implementations must rely on the ETL and Dataset manual configurations to create another dataset in Platform mirroring the source dataset which can be used to keep deferred data. In this case, deferred data will be similar to snapshot data. In every execution of the ETL transform, the source data will be united with deferred data and sent for processing.
Changelog
/xdms
endpoint in the Catalog API. This has been replaced by a “schemaRef” that provides the “id”, “version”, and “contentType” of the schema as referenced in the new Schema Registry API.