Ingest batch data
In this lesson, you will ingest batch data into Experience Platform using various methods.
Batch data ingestion allows you to ingest a large amount of data into Adobe Experience Platform at once. You can ingest batch data in a one time upload within Platform’s interface or using the API. You can also configure regularly scheduled batch uploads from third-party services such as cloud storage services using Source connectors.
Data Engineers will need to ingest batch data outside of this tutorial.
Before you begin the exercises, watch this short video to learn more about data ingestion:
Permissions required
In the Configure Permissions lesson, you set up all the access controls required to complete this lesson.
You will need access to an (S)FTP server or cloud storage solution for the Sources exercise. There is a workaround if you do not have one.
Ingest data in batches with Platform user interface
Data can be uploaded directly into a dataset on the datasets screen in JSON and parquet formats. This is a great way to test ingestion of some of your data after creating a
Download and prep the data
First, get the sample data and customize it for your tenant:
-
Download luma-data.zip to your Luma Tutorial Assets folder.
-
Unzip the file, creating a folder called
luma-data
which contains the four data files we will use in this lesson -
Open
luma-loyalty.json
in a text editor and replace all instances of_techmarketingdemos
with your own underscore-tenant id, as seen in your own schemas:
-
Save the updated file
Ingest the data
-
In the Platform user interface, select Datasets in the left navigation
-
Open your
Luma Loyalty Dataset
-
Scroll down until you see the Add Data section in the right column
-
Upload the
luma-loyalty.json
file. -
Once the file uploads, a row for the batch will appear
-
If you reload the page after a few minutes, you should see that the batch has successfully uploaded with 1000 records and 1000 profile fragments.
- Enabling error diagnostics generates data about the ingestion of your data, which you can then review using the Data Access API. Learn more about it in the documentation.
- Partial ingestion allows you to ingest data containing errors, up to a certain threshold which you can specify. Learn more about it in the documentation
Validate the data
There are a few ways to confirm that the data was successfully ingested.
Validate in the Platform user interface
To confirm that the data was ingested into the dataset:
-
On the same page where you have ingested the data, select the Preview dataset button on top-right
-
Select the Preview button and you should be able to see some of the ingested data.
To confirm that the data landed in Profile (may take a few minutes for the data to land):
- Go to Profiles in the left navigation
- Select the icon next to the Select identity namespace field to open the modal
- Select your
Luma Loyalty Id
namespace - Then enter one of the
loyaltyId
values from your dataset,5625458
- Select View
Validate with data ingestion events
If you subscribed to data ingestion events in the previous lesson, check your unique webhook.site URL. You should see three requests show up in the following order, with some time in between them, with the following eventCode
values:
ing_load_success
—the batch as ingestedig_load_success
—the batch was ingested into identity graphps_load_success
—the batch was ingested into profile service
See the documentation for more details on the notifications.
Ingest data in batches with Platform API
Now let’s upload data using the API.
Download and prep the data
- You should have already downloaded and unzipped luma-data.zip into your
Luma Tutorial Assets
folder. - Open
luma-crm.json
in a text editor and replace all instances of_techmarketingdemos
with your own underscore-tenant id, as seen in your schemas - Save the updated file
Get the dataset id
First we let’s get the id of the dataset id of the dataset into which we want to ingest data:
- Open Postman
- If you don’t have an access token, open the request OAuth: Request Access Token and select Send to request a new access token, just like you did in the Postman lesson.
- Open your environment variables and make sure the value of CONTAINER_ID is still
tenant
- Open the request Catalog Service API > Datasets > Retrieve a list of datasets. and select Send
- You should get a
200 OK
response - Copy the id of the
Luma CRM Dataset
from the Response body
Create the batch
Now we can create a batch in the dataset:
-
Download Data Ingestion API.postman_collection.json to your
Luma Tutorial Assets
folder -
Import the collection into Postman
-
Select the request Data Ingestion API > Batch Ingestion > Create a new batch in Catalog Service.
-
Paste the following as the Body of the request, replacing the datasetId value with your own:
code language-json { "datasetId":"REPLACE_WITH_YOUR_OWN_DATASETID", "inputFormat": { "format": "json" } }
-
Select the Send button
-
You should get a 201 Created response containing the id of your new batch!
-
Copy the
id
of the new batch
Ingest the data
Now we can upload the data into the batch:
-
Select the request Data Ingestion API > Batch Ingestion > Upload a file to a dataset in a batch.
-
In the Params tab, enter your dataset id and batch id into their respective fields
-
In the Params tab, enter
luma-crm.json
as the filePath -
In the Body tab, select the binary option
-
Select the downloaded
luma-crm.json
from your localLuma Tutorial Assets
folder -
Select Send and you should get a 200 OK response with ‘1’ in the response body
At this point, if you look at your batch in the Platform user interface, you will see that it is in a “Loading” status:
Because the Batch API is often used to upload multiple files, you need need to tell Platform when a batch is complete, which we will do in the next step.
Complete the batch
To complete the batch:
-
Select the request Data Ingestion API > Batch Ingestion > Finish uploading a file to a dataset in a batch.
-
In the Params tab, enter
COMPLETE
as the action -
In the Params tab, enter your batch id. Do not worry about dataset id or filePath, if they are present.
-
Make sure that the URL of the POST is
https://platform.adobe.io/data/foundation/import/batches/:batchId?action=COMPLETE
and that there aren’t any unnecessary references to thedatasetId
orfilePath
-
Select Send and you should get a 200 OK response with ‘1’ in the response body
Validate the data
Validate in the Platform user interface
Validate the data has landed in the Platform user interface just like you did for the Loyalty dataset.
First, confirm the batch shows that 1000 records have ingested:
Next, confirm the batch using Preview dataset:
Finally, confirm one of your profiles has been created by looking up one of the profiles by the Luma CRM Id
namespace, for example 112ca06ed53d3db37e4cea49cc45b71e
There is one interesting thing that just happened that I want to point out. Open that Danny Wright
profile. The profile has both a Lumacrmid
and a Lumaloyaltyid
. Remember the Luma Loyalty Schema
contained two identity fields, Luma Loyalty Id and CRM Id. Now that we’ve uploaded both datasets, they’ve merged into a single profile. The Loyalty data had Daniel
as the first name and “New York City” as the home address, while the CRM data had Danny
as the first name and Portland
as the home address for the customer with the same Loyalty Id. We will come back to why the first name displays Danny
in the lesson on merge policies.
Congratulations, you’ve just merged profiles!
Validate with data ingestion events
If you subscribed to data ingestion events in the previous lesson, check your unique webhook.site URL. You should see three requests come in, just like with the loyalty data:
See the documentation for more details on the notifications.
Ingest data with Workflows
Let’s look at another way of uploading data. The workflows feature allows you to ingest CSV data which is not already modeled in XDM.
Download and prep the data
- You should have already downloaded and unzipped luma-data.zip into your
Luma Tutorial Assets
folder. - Confirm that you have
luma-products.csv
Create a workflow
Now let’s set up workflow:
- Go to Workflows in the left navigation
- Select Map CSV to XDM schema and select the Launch button
- Select your
Luma Product Catalog Dataset
and select the Next button
- Add the
luma-products.csv
file you downloaded and select the Next button
- Now you are in the mapper interface, in which you can map a field from the source data (one of the column names in the
luma-products.csv
file) to XDM fields in the target schema. In our example, the column names are close enough to the schema field names that the mapper is able to auto-detect the right mapping! If the mapper was unable to auto-detect the right field, you would select the icon to the right of the target field to select the correct XDM field. Also, if you didn’t want to ingest one of the columns from the CSV, you could delete the row from the mapper. Feel free to play around and change column headings in theluma-products.csv
to get familiar with how the mapper works. - Select the Finish button
Validate the data
When the batch has uploaded, verify the upload by previewing the dataset.
Since the Luma Product SKU
is a non-people namespace, we won’t see any profiles for the product skus.
You should see the three hits to your webhook.
Ingest data with Sources
Okay, you did things the hard way. Now let’s move into the promised land of automated batch ingestion! When I say, “SET IT!” you say, “FORGET IT!” “SET IT!” “FORGET IT!” “SET IT!” “FORGET IT!” Just kidding, you would never do such a thing! Ok, back to work. You’re almost done.
Go to Sources in the left navigation to open the Sources catalog. Here you will see various out-of-the-box integrations with industry-leading data and storage providers.
Okay, let’s ingest data using a source connector.
This exercise will be choose-your-own-adventure style. I am going to show the workflow using the FTP source connector. You can either use a different Cloud Storage source connector that you use at your company, or upload the json file using the dataset user interface like we did with the loyalty data.
Many of the Sources have a similar configuration workflow, in which you:
- Enter your authentication details
- Select the data you want to ingest
- Select the Platform dataset into which you want to ingest it
- Map the fields to your XDM schema
- Choose the frequency with which you want to reingest data from that location
Download, prep, and upload the data to your preferred cloud storage vendor
- You should have already downloaded and unzipped luma-data.zip into your
Luma Tutorial Assets
folder. - Open
luma-offline-purchases.json
in a text editor and replace all instances of_techmarketingdemos
with your own underscore-tenant id, as seen in your schemas - Update all of the timestamps so that the events occur in the last month (for example, search for
"timestamp":"2022-06
and replace the year and month) - Choose your preferred cloud storage provider, making sure it is available in the Sources catalog
- Upload
luma-offline-purchases.json
to a location in your preferred cloud storage provider
Ingest the data to your preferred cloud storage location
-
In the Platform user interface, filter the Sources catalog to Cloud storage
-
Note that there are convenient links to documentation under the
...
-
In the box of your preferred Cloud storage vendor, select the Configure button
-
Authentication is the first step. Enter the name for your account, for example
Luma's FTP Account
and your authentication details. This step should be fairly similar for all cloud storage sources, although the fields may vary slightly. Once you’ve entered the authentication details for an account, you can reuse them for other source connections that might be sending different data on different schedules from other files in the same account -
Select the Connect to source button
-
When Platform has successfully connected to the Source, select the Next button
-
On the Select data step, the user interface will use your credentials to open the folder on your cloud storage solution
-
Select the files you would like to ingest, for example
luma-offline-purchases.json
-
As the Data format, select
XDM JSON
-
You can then preview the json structure and sample data in your file
-
Select the Next button
-
On the Mapping step, select your
Luma Offline Purchase Events Dataset
and select the Next button. Note in the message that since the data we are ingesting is a JSON file, there is no mapping step where we map source field to target field. JSON data must be in XDM already. If you were ingesting a CSV, you would see the full mapping user interface on this step:
-
On the Scheduling step, you choose the frequency with which you want to reingest data from the Source. Take a moment to look at the options. We are just going to do a one-time ingestion, so leave the Frequency on Once and select the Next button:
-
On the Dataflow detail step, you can choose a name for your dataflow, enter an optional description, turn on error diagnostics, and partial ingestion. Leave the settings as they are and select the Next button:
-
On the Review step, you can review all of your settings together and either edit them or select the Finish button
-
After saving you will land on a screen like this:
Validate the data
When the batch has uploaded, verify the upload by previewing the dataset.
You should see the three hits to your webhook.
Look up the profile with value 5625458
in the loyaltyId
namespace again to see if there are any purchase events in their profile. You should see one purchase. You can dig into the details of the purchase by selecting View JSON:
ETL Tools
Adobe partners with multiple ETL vendors to support data ingestion into Experience Platform. Because of the variety of third-party vendors, ETL is not covered in this tutorial, although you are welcome to review some of these resources:
Additional Resources
Now let’s stream data using the Web SDK