>WHAT WE ARE WIRED FOR

APPLIED AI

Agentic Systems

RAG Solutions

Evaluation

Fine Tuning

MLOps

PRODUCT ENGINEERING

Frontend Engineering

Backend Engineering

Infrastructure & Reliability

Data Engineering

OUR ENGINEERING PRINCIPLES

LEADERSHIP

←Back

Ad-Tech

Unifying Heterogeneity: Ingesting Multi-Source First-Party Data for Cookieless Retargeting

Problem Statement

Amidst rising user privacy concerns, major players such as Apple and Google have initiated restrictions on third-party browser cookies. Third-party cookies compromise user privacy by enabling cross-site tracking without explicit consent. The inevitable deprecation of third-party cookies emphasise the importance of relying on first-party data, which is privacy compliant. The client needed to transition from processing third-party cookie data to utilizing first-party data for retargeting advertising campaigns. In addressing this shift, they encountered two challenges: - The first-party data across companies is in different schemas and is also stored in various systems (e.g., SharePoint, S3 buckets) - Each schema requires distinct processing logic.

Client Info

A leading marketing agency, specialized in retargeting campaigns.

Outcomes

Time to onboard a new customer reduced from 2 months to 1 week.

Resilient system that can read and process large files (upto 100GB).

Sales team differentiated their pitch to onboard new customers citing the pilot success stories.

How did BeautifulCode do it?

We studied diverse data systems such as SharePoint, S3 buckets, API endpoints, and SFTP, through which companies share their first-party data. The goal was to extract data from these systems and place it into company-specific buckets for isolation and security compliance. Integrating each new system posed challenges, especially in understanding unique processes like authentication and downloading. To address this, we thoroughly analyzed these systems and developed tailored libraries. We also built an easy to use interfaces to access these libraries for simpler integration with different data systems.

Cleansing and Transformation pipeline stages

Our data exists in various formats, including Avro, JSON, and CSV, among others. Each of these formats necessitates distinct data pipeline operations, such as cleaning and transformation, to be defined separately. To streamline this process, our strategy involves converting all these diverse data formats into a common CSV format. This uniformity allows us to run the data through the same pipeline efficiently. Additionally, this approach simplifies our optimization efforts, as we only need to focus on enhancing the processing of the CSV format, rather than tweaking the pipeline for every different format. This method not only standardizes our data processing but also significantly boosts efficiency and consistency across our operations.

The following are the different aspects of the solution

Abstract data connector framework

This framework allows for seamless data retrieval from various sources, including FTP, SharePoint, APIs, and different cloud storage platforms. It simplifies the process of integrating new data sources by providing a standard interface and handling authentication and authorization mechanisms.

Isolated Cloud Storage

Data from each customer is securely stored in separate cloud storage buckets, ensuring privacy and compliance with data regulations. Granular IAM policies are implemented to restrict access to specific data sets. The creation of buckets for individual customers and the assignment of detailed IAM policies to these buckets are orchestrated as infrastructure-as-code operations using Terraform.

Event-Driven Architecture

The data ingestion process is triggered by events, automatically initiating the data cleaning and transformation pipelines. This eliminates the need for manual intervention and ensures real-time data processing.

Configurable Pipelines

Every company's data has its own unique format and dimensions, making it challenging to create individual pipelines for each. Recognizing that this approach is not scalable, we introduced configurable pipelines, serving as flexible building blocks to compose new pipelines swiftly. This strategy not only accelerated the creation of pipelines but also enhances the overall flexibility.

Boosting Pipeline Performance

While Pandas excels at data analysis, it can struggle with massive datasets, often exceeding RAM limitations and causing performance bottlenecks. To overcome these challenges, we utilized DASK which was built on top of Pandas' API. DASK seamlessly scales data processing to multiple cores and even distributed clusters. DASK allows us to read data in chunks, parallelizing operations across available resources. This gave a performance boost to the pipelines, enabling us to handle large datasets with fewer machines.

Technologies Used

Pandas for for data manipulation and analysi

DASK for to handle large scale processin

Google Cloud for to store raw files and intermediary output

Google Kubernetes Engine for to run pipeline

BigQuery for to store the transformed outpu

PubSub for for async communication layer across service

Terraform for Infrastructure orchestration is handled through Terraform