Pipeline run

c2b11bf5-0ed4-4d40-b6df-b822db58604b

Pipeline LLM cost (USD)

API 1: $0.0058 API 2: $0.0000 API 3: $0.0000 Total: $0.0058

Client output enrichment

v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA description

role baseline loaded sources · ai_index: jd · nature_of_work: jd · tech_stack_maturity: jd

Nature of work · Data pipeline development

Build and operate real-time data pipelines and lakehouse/OLAP layers, adding observability, fault-tolerance, and SQL optimization while modernizing legacy ETL into scalable bronze→silver→gold workflows.

"Build and maintain high-throughput, real-time data pipelines using Kafka/Pulsar with Spark"

Tech stack maturity

Modern Cloud Native

The stack centers on containerization, Kubernetes, Terraform, Airflow/Dagster orchestration, Kafka, Spark/Flink, dbt, and cloud-oriented data engineering patterns, which aligns best with modern cloud-native systems.

AI index (0 = no AI use, 5 = totally AI-dependent · v2.1)

0.20 / 5

· Title match

✓ Has AI skill

· AI skill (primary)

· AI skill (secondary)

· On AI team

· Builds AI products

vocab breakdown (legacy)

Assistants (×1): —

Frameworks (×2): —

Models / concepts (×3): AI

Evidence — skills matched in JD (44)

Kafka Pulsar Apache Spark Checkpointing Replay Logic Data Observability Data Quality SLA Alerts Anomaly Detection Data Lineage Apache Iceberg Polaris Gravitino ClickHouse StarRocks Bronze-Silver-Gold Data Modeling Airflow dbt Dagster SQLMesh SQL Trino Apache Flink Python Java +19

Skill cluster (13 dimension groups, role-scoped)

Programming Languages for Data Work

SQL Python Java

ETL and ELT Tooling

Apache Spark dbt

Cloud Platforms

Distributed Systems

Container Orchestration Platforms

Kubernetes

Containerization and Image Builds

Docker

Data Pipeline Orchestration

Dagster

Data Quality and Reconciliation

Anomaly Detection

Data Serialization Standards & Protocols

Parquet

Infrastructure as Code

Terraform

Messaging and Event Streaming

Kafka

Relational Database Design

Indexing

Stream Processing Systems

Apache Flink

Cross-cutting / unaligned

Pulsar Checkpointing Replay Logic Data Observability Data Quality SLA Alerts Data Lineage Apache Iceberg Polaris Gravitino ClickHouse StarRocks Bronze-Silver-Gold Data Modeling Airflow SQLMesh Trino Iceberg REST Catalogs Data Structures and Algorithms Sorting Searching Memory Models OLTP OLAP Query Execution Storage Formats Monitoring Cube.js RisingWave Arroyo

Show KRA description ↓

• Build and maintain high-throughput, real-time data pipelines using Kafka/Pulsar with Spark, • Design fault-tolerant systems with zero-data-loss principles — checkpointing, replay logic, • Implement data observability — quality checks, SLA alerts, anomaly detection, lineage, and • Design and manage Iceberg-based lakehouse tables (Polaris/Gravitino catalogs, schema • Build fast OLAP layers using ClickHouse / StarRocks. • Model data across bronze → silver → gold layers for downstream teams. • Migrate and modernize legacy pipelines into scalable, distributed workflows. • Orchestrate ETL workloads using Airflow, DBT, Dagster, SQLMesh. • Optimize SQL transformations and distributed execution across Trino/Spark. • Ensure strict security and governance across all data layers — access control, encryption, • Collaborate with backend, analytics, and platform teams for seamless data delivery. • Extremely strong SQL — window functions, query planning, optimization. • High comfort working with distributed & parallel workloads. • Hands-on experience with some-many of these technologies : Apache Spark, Apache Flink, • Advanced experience in Python (preferred) or Java (strong fundamentals). • Strong understanding of Parquet, Apache Iceberg, and Iceberg REST catalogs (Polaris / • Experience with OLAP databases — ClickHouse / StarRocks. • Experience with semantic layers — Cube.js or similar. • Strong experience building pipelines with Airflow, DBT, Dagster, SQLMesh. • Solid understanding of data structures & algorithms — sorting, searching, memory models. • Strong grasp of OLTP vs OLAP, indexing, query execution, and storage formats. • Ability to debug distributed systems end-to-end (compute, storage, network, orchestration). • Familiarity with cloud environments, containerization (Docker), and monitoring. • Experience with large-scale data — high throughput, billions of rows, large parallel workloads. • Awareness of cost optimization in compute & storage. • Experience with emerging stream processors — Dagster, RisingWave, Arroyo. • Kubernetes, Terraform, or cloud-native big-data stacks. • Strong ownership — takes systems from design → build → monitor. • Self-driven, independent, and comfortable making technical decisions. • High attention to reliability, data accuracy, and operational excellence. • Naturally grows into broader technical responsibility as the platform scales.

Signals

Skill data-engineer

0.29

Alias data-engineer

1.00

KRA data-engineer

0.66

Post-classification

Centroidupdated · n=84

Alias collision log—

New-role queue—

New skills captured27

New KRA captured—

Captured for admin review

Pulsar primary ↔ Data Engineer pending

Checkpointing primary ↔ Data Engineer pending

Replay Logic primary ↔ Data Engineer pending

Data Observability primary ↔ Data Engineer pending

Data Quality primary ↔ Data Engineer pending

SLA Alerts primary ↔ Data Engineer pending

Data Lineage primary ↔ Data Engineer pending

Apache Iceberg primary ↔ Data Engineer pending

Polaris primary ↔ Data Engineer pending

Gravitino primary ↔ Data Engineer pending

ClickHouse primary ↔ Data Engineer pending

StarRocks primary ↔ Data Engineer pending

Bronze-Silver-Gold Data Modeling primary ↔ Data Engineer pending

SQLMesh primary ↔ Data Engineer pending

Trino primary ↔ Data Engineer pending

Iceberg REST Catalogs primary ↔ Data Engineer pending

Cube.js ↔ Data Engineer pending

Data Structures and Algorithms primary ↔ Data Engineer pending

Sorting primary ↔ Data Engineer pending

Searching primary ↔ Data Engineer pending

Memory Models primary ↔ Data Engineer pending

OLTP primary ↔ Data Engineer pending

OLAP primary ↔ Data Engineer pending

Query Execution primary ↔ Data Engineer pending

Storage Formats primary ↔ Data Engineer pending

RisingWave ↔ Data Engineer pending

Arroyo ↔ Data Engineer pending

Status: extract_from_jd_done Created: 2026-05-27T13:52:24.861248Z Updated: 2026-05-27T13:52:28.407782Z

Flow Current 3-step pipeline

1 POST /skills/extract-from-jd

2 POST /skills/extract-details

3 POST /skills/final-role-output

Role Chosen role & resolution

No chosen role stored for this run.

Job description

Experience: 5.00 + years

Salary: Confidential (based on experience)

Shift: (GMT+05:30) Asia/Kolkata (IST)

Opportunity Type: Remote

Placement Type: Full time Permanent Position

(*Note: This is a requirement for one of Uplers' client - 1digitalstack.ai)

What do you need for this opportunity?

Must have skills required:

Python, Java, Iceberg, Kafka, Apache Beam, Apache Flink, Apache pulsar, Spark, Trino, OLAP, ClickHouse, starrocks

1digitalstack.ai is Looking for:

Role - Senior Data Engineer

Experience - 5-7 Years

Location - Remote (India)

About 1DigitalStack.ai

1DigitalStack.ai combines AI and deep eCommerce data to help global brands grow faster on online

marketplaces. Our platforms deliver advanced analytics, actionable intelligence, and media

automation — enabling brands to optimize visibility, efficiency, and sales performance at scale.

We partner with India’s top consumer companies — Unilever, Marico, Coca-Cola, Tata Consumer, Dabur,

and Unicharm — across 125+ marketplaces globally.

Backed by leading venture investors and powered by a 220+ member team, we’re in our $5–10M

growth journey, scaling rapidly across categories and geographies to redefine how brands win on

digital shelves.

🔗 Check out more at www.1digitalstack.ai

About Role

This is a high-impact, hands-on engineering role owning the core data systems that power our

analytics, AI, and automation stack.

You’ll work closely with the CTO and Engineering Leads and independently manage large,

high-throughput data pipelines that process millions of events.

Responsibilities :

• Build and maintain high-throughput, real-time data pipelines using Kafka/Pulsar with Spark,

Flink, and distributed compute engines.

• Design fault-tolerant systems with zero-data-loss principles — checkpointing, replay logic,

DLQs, deduplication, and back-pressure handling.

• Implement data observability — quality checks, SLA alerts, anomaly detection, lineage, and

metadata insights.

• Design and manage Iceberg-based lakehouse tables (Polaris/Gravitino catalogs, schema

evolution, compaction).

• Build fast OLAP layers using ClickHouse / StarRocks.
• Model data across bronze → silver → gold layers for downstream teams.
• Migrate and modernize legacy pipelines into scalable, distributed workflows.
• Orchestrate ETL workloads using Airflow, DBT, Dagster, SQLMesh.
• Optimize SQL transformations and distributed execution across Trino/Spark.
• Ensure strict security and governance across all data layers — access control, encryption,

auditability.

• Collaborate with backend, analytics, and platform teams for seamless data delivery.

Requirements

Core Technical Skills

• Extremely strong SQL — window functions, query planning, optimization.
• High comfort working with distributed & parallel workloads.
• Hands-on experience with some-many of these technologies : Apache Spark, Apache Flink,

Trino, Apache Kafka, Apache Pulsar, Apache Beam

• Advanced experience in Python (preferred) or Java (strong fundamentals).
• Strong understanding of Parquet, Apache Iceberg, and Iceberg REST catalogs (Polaris /

Gravitino).

• Experience with OLAP databases — ClickHouse / StarRocks.
• Experience with semantic layers — Cube.js or similar.
• Strong experience building pipelines with Airflow, DBT, Dagster, SQLMesh.

Foundational Strengths

• Solid understanding of data structures & algorithms — sorting, searching, memory models.
• Strong grasp of OLTP vs OLAP, indexing, query execution, and storage formats.
• Ability to debug distributed systems end-to-end (compute, storage, network, orchestration).
• Familiarity with cloud environments, containerization (Docker), and monitoring.
• Experience with large-scale data — high throughput, billions of rows, large parallel workloads.
• Awareness of cost optimization in compute & storage.

Good to Have

• Experience with emerging stream processors — Dagster, RisingWave, Arroyo.
• Kubernetes, Terraform, or cloud-native big-data stacks.

Mindset

• Strong ownership — takes systems from design → build → monitor.
• Self-driven, independent, and comfortable making technical decisions.
• High attention to reliability, data accuracy, and operational excellence.
• Naturally grows into broader technical responsibility as the platform scales.

Why 1DS is a great choice

• High-trust, no-politics culture — we value communication, ownership, and accountability
• Collaborative, ego-free team — building together is in our DNA
• Learning-first environment — mentorship, peer reviews, and exposure to real business impact
• Modern stack + autonomy — your voice shapes how we build
• VC-funded & scaling fast — 250+ strong, building from India for the world

How to apply for this opportunity?

• Step 1: Click On Apply! And Register or Login on our portal.
• Step 2: Complete the Screening Form & Upload updated Resume
• Step 3: Increase your chances to get shortlisted & meet the client for the Interview!

About Uplers:

Our goal is to make hiring reliable, simple, and fast. Our role will be to help all our talents find and apply for relevant contractual onsite opportunities and progress in their career. We will support any grievances or challenges you may face during the engagement.

(Note: There are many more opportunities apart from this on the portal. Depending on the assessments you clear, you can apply for them as well).

So, if you are ready for a new challenge, a great work environment, and an opportunity to take your career to the next level, don't hesitate to apply today. We are waiting for you!

Skills from this JD

Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.

Kafka Primary No API 2 row (run stopped after API 1 or history missing)

Pulsar Primary No API 2 row (run stopped after API 1 or history missing)

Apache Spark Primary No API 2 row (run stopped after API 1 or history missing)

Checkpointing Primary No API 2 row (run stopped after API 1 or history missing)

Replay Logic Primary No API 2 row (run stopped after API 1 or history missing)

Data Observability Primary No API 2 row (run stopped after API 1 or history missing)

Data Quality Primary No API 2 row (run stopped after API 1 or history missing)

SLA Alerts Primary No API 2 row (run stopped after API 1 or history missing)

Anomaly Detection Primary No API 2 row (run stopped after API 1 or history missing)

Data Lineage Primary No API 2 row (run stopped after API 1 or history missing)

Apache Iceberg Primary No API 2 row (run stopped after API 1 or history missing)

Polaris Primary No API 2 row (run stopped after API 1 or history missing)

Gravitino Primary No API 2 row (run stopped after API 1 or history missing)

ClickHouse Primary No API 2 row (run stopped after API 1 or history missing)

StarRocks Primary No API 2 row (run stopped after API 1 or history missing)

Bronze-Silver-Gold Data Modeling Primary No API 2 row (run stopped after API 1 or history missing)

Airflow Primary No API 2 row (run stopped after API 1 or history missing)

dbt Primary No API 2 row (run stopped after API 1 or history missing)

Dagster Primary No API 2 row (run stopped after API 1 or history missing)

SQLMesh Primary No API 2 row (run stopped after API 1 or history missing)

SQL Primary No API 2 row (run stopped after API 1 or history missing)

Trino Primary No API 2 row (run stopped after API 1 or history missing)

Apache Flink Primary No API 2 row (run stopped after API 1 or history missing)

Python Primary No API 2 row (run stopped after API 1 or history missing)

Java Primary No API 2 row (run stopped after API 1 or history missing)

Parquet Primary No API 2 row (run stopped after API 1 or history missing)

Iceberg REST Catalogs Primary No API 2 row (run stopped after API 1 or history missing)

Cube.js Secondary No API 2 row (run stopped after API 1 or history missing)

Data Structures and Algorithms Primary No API 2 row (run stopped after API 1 or history missing)

Sorting Primary No API 2 row (run stopped after API 1 or history missing)

Searching Primary No API 2 row (run stopped after API 1 or history missing)

Memory Models Primary No API 2 row (run stopped after API 1 or history missing)

OLTP Primary No API 2 row (run stopped after API 1 or history missing)

OLAP Primary No API 2 row (run stopped after API 1 or history missing)

Indexing Primary No API 2 row (run stopped after API 1 or history missing)

Query Execution Primary No API 2 row (run stopped after API 1 or history missing)

Storage Formats Primary No API 2 row (run stopped after API 1 or history missing)

Distributed Systems Primary No API 2 row (run stopped after API 1 or history missing)

Docker Primary No API 2 row (run stopped after API 1 or history missing)

Monitoring Primary No API 2 row (run stopped after API 1 or history missing)

Kubernetes Primary No API 2 row (run stopped after API 1 or history missing)

Terraform Primary No API 2 row (run stopped after API 1 or history missing)

RisingWave Secondary No API 2 row (run stopped after API 1 or history missing)

Arroyo Secondary No API 2 row (run stopped after API 1 or history missing)

Library artifacts (this run)

No artifact rows for this run.

nano JD Parser — gpt-4.1-nano click to toggle

RoleSenior Data Engineer

Company1DigitalStack.ai

Experience5-7 Years

DomainE-commerce

Location India (remote)

JD type pass

Show raw JSON

{
  "JD_type": "pass",
  "about_company": {
    "source_marker": {
      "first_5_words": "1DigitalStack.ai combines AI and deep",
      "last_5_words": "how brands win on digital shelves."
    },
    "text": "1DigitalStack.ai combines AI and deep eCommerce data to help global brands grow faster on online marketplaces. Our platforms deliver advanced analytics, actionable intelligence, and media automation \u2014 enabling brands to optimize visibility, efficiency, and sales performance at scale. We partner with India\u2019s top consumer companies \u2014 Unilever, Marico, Coca-Cola, Tata Consumer, Dabur, and Unicharm \u2014 across 125+ marketplaces globally. Backed by leading venture investors and powered by a 220+ member team, we\u2019re in our $5\u201310M growth journey, scaling rapidly across categories and geographies to redefine how brands win on digital shelves.",
    "word_count": 84
  },
  "certifications": [],
  "company_name": "1DigitalStack.ai",
  "ctc": null,
  "domain": {
    "primary": {
      "aliases": [
        "Online Retail",
        "Marketplaces"
      ],
      "domain": "E-commerce"
    },
    "secondary": null
  },
  "education": [],
  "experience": {
    "max": 7,
    "min": 5,
    "raw": "5-7 Years"
  },
  "job_locations": [
    {
      "aliases": [],
      "city": null,
      "country": "India",
      "state": null,
      "work_mode": "remote"
    }
  ],
  "role": "Senior Data Engineer",
  "role_aliases": [
    "Data Engineer",
    "Senior Data Engineer",
    "Big Data Engineer"
  ],
  "role_archetype": "Data",
  "roles_and_responsibilities": [
    {
      "bullet_count": 11,
      "heading": "Responsibilities",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "\u2022 Build and maintain high-throughput,",
        "last_5_words": "and platform teams for seamless data delivery."
      },
      "text": "\u2022 Build and maintain high-throughput, real-time data pipelines using Kafka/Pulsar with Spark,\n\n\u2022 Design fault-tolerant systems with zero-data-loss principles \u2014 checkpointing, replay logic,\n\n\u2022 Implement data observability \u2014 quality checks, SLA alerts, anomaly detection, lineage, and\n\n\u2022 Design and manage Iceberg-based lakehouse tables (Polaris/Gravitino catalogs, schema\n\n\u2022 Build fast OLAP layers using ClickHouse / StarRocks.\n\u2022 Model data across bronze \u2192 silver \u2192 gold layers for downstream teams.\n\u2022 Migrate and modernize legacy pipelines into scalable, distributed workflows.\n\u2022 Orchestrate ETL workloads using Airflow, DBT, Dagster, SQLMesh.\n\u2022 Optimize SQL transformations and distributed execution across Trino/Spark.\n\u2022 Ensure strict security and governance across all data layers \u2014 access control, encryption,\n\n\u2022 Collaborate with backend, analytics, and platform teams for seamless data delivery.",
      "word_count": 134
    },
    {
      "bullet_count": 8,
      "heading": "Core Technical Skills",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "\u2022 Extremely strong SQL \u2014 window functions,",
        "last_5_words": "with Airflow, DBT, Dagster, SQLMesh."
      },
      "text": "\u2022 Extremely strong SQL \u2014 window functions, query planning, optimization.\n\u2022 High comfort working with distributed \u0026 parallel workloads.\n\u2022 Hands-on experience with some-many of these technologies : Apache Spark, Apache Flink,\n\u2022 Advanced experience in Python (preferred) or Java (strong fundamentals).\n\u2022 Strong understanding of Parquet, Apache Iceberg, and Iceberg REST catalogs (Polaris /\n\u2022 Experience with OLAP databases \u2014 ClickHouse / StarRocks.\n\u2022 Experience with semantic layers \u2014 Cube.js or similar.\n\u2022 Strong experience building pipelines with Airflow, DBT, Dagster, SQLMesh.",
      "word_count": 104
    },
    {
      "bullet_count": 6,
      "heading": "Foundational Strengths",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "\u2022 Solid understanding of data structures",
        "last_5_words": "in compute \u0026 storage."
      },
      "text": "\u2022 Solid understanding of data structures \u0026 algorithms \u2014 sorting, searching, memory models.\n\u2022 Strong grasp of OLTP vs OLAP, indexing, query execution, and storage formats.\n\u2022 Ability to debug distributed systems end-to-end (compute, storage, network, orchestration).\n\u2022 Familiarity with cloud environments, containerization (Docker), and monitoring.\n\u2022 Experience with large-scale data \u2014 high throughput, billions of rows, large parallel workloads.\n\u2022 Awareness of cost optimization in compute \u0026 storage.",
      "word_count": 104
    },
    {
      "bullet_count": 2,
      "heading": "Good to Have",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "\u2022 Experience with emerging stream processors",
        "last_5_words": "or cloud-native big-data stacks."
      },
      "text": "\u2022 Experience with emerging stream processors \u2014 Dagster, RisingWave, Arroyo.\n\u2022 Kubernetes, Terraform, or cloud-native big-data stacks.",
      "word_count": 24
    },
    {
      "bullet_count": 4,
      "heading": "Mindset",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "\u2022 Strong ownership \u2014 takes systems",
        "last_5_words": "responsibility as the platform scales."
      },
      "text": "\u2022 Strong ownership \u2014 takes systems from design \u2192 build \u2192 monitor.\n\u2022 Self-driven, independent, and comfortable making technical decisions.\n\u2022 High attention to reliability, data accuracy, and operational excellence.\n\u2022 Naturally grows into broader technical responsibility as the platform scales.",
      "word_count": 40
    }
  ],
  "urls": [
    {
      "type": "website",
      "url": "http://www.1digitalstack.ai"
    }
  ]
}

API 1 — extract-from-jd click to toggle

{
  "final_skills": [
    {
      "is_primary": true,
      "skill_name": "Kafka"
    },
    {
      "is_primary": true,
      "skill_name": "Pulsar"
    },
    {
      "is_primary": true,
      "skill_name": "Apache Spark"
    },
    {
      "is_primary": true,
      "skill_name": "Checkpointing"
    },
    {
      "is_primary": true,
      "skill_name": "Replay Logic"
    },
    {
      "is_primary": true,
      "skill_name": "Data Observability"
    },
    {
      "is_primary": true,
      "skill_name": "Data Quality"
    },
    {
      "is_primary": true,
      "skill_name": "SLA Alerts"
    },
    {
      "is_primary": true,
      "skill_name": "Anomaly Detection"
    },
    {
      "is_primary": true,
      "skill_name": "Data Lineage"
    },
    {
      "is_primary": true,
      "skill_name": "Apache Iceberg"
    },
    {
      "is_primary": true,
      "skill_name": "Polaris"
    },
    {
      "is_primary": true,
      "skill_name": "Gravitino"
    },
    {
      "is_primary": true,
      "skill_name": "ClickHouse"
    },
    {
      "is_primary": true,
      "skill_name": "StarRocks"
    },
    {
      "is_primary": true,
      "skill_name": "Bronze-Silver-Gold Data Modeling"
    },
    {
      "is_primary": true,
      "skill_name": "Airflow"
    },
    {
      "is_primary": true,
      "skill_name": "dbt"
    },
    {
      "is_primary": true,
      "skill_name": "Dagster"
    },
    {
      "is_primary": true,
      "skill_name": "SQLMesh"
    },
    {
      "is_primary": true,
      "skill_name": "SQL"
    },
    {
      "is_primary": true,
      "skill_name": "Trino"
    },
    {
      "is_primary": true,
      "skill_name": "Apache Flink"
    },
    {
      "is_primary": true,
      "skill_name": "Python"
    },
    {
      "is_primary": true,
      "skill_name": "Java"
    },
    {
      "is_primary": true,
      "skill_name": "Parquet"
    },
    {
      "is_primary": true,
      "skill_name": "Iceberg REST Catalogs"
    },
    {
      "is_primary": false,
      "skill_name": "Cube.js"
    },
    {
      "is_primary": true,
      "skill_name": "Data Structures and Algorithms"
    },
    {
      "is_primary": true,
      "skill_name": "Sorting"
    },
    {
      "is_primary": true,
      "skill_name": "Searching"
    },
    {
      "is_primary": true,
      "skill_name": "Memory Models"
    },
    {
      "is_primary": true,
      "skill_name": "OLTP"
    },
    {
      "is_primary": true,
      "skill_name": "OLAP"
    },
    {
      "is_primary": true,
      "skill_name": "Indexing"
    },
    {
      "is_primary": true,
      "skill_name": "Query Execution"
    },
    {
      "is_primary": true,
      "skill_name": "Storage Formats"
    },
    {
      "is_primary": true,
      "skill_name": "Distributed Systems"
    },
    {
      "is_primary": true,
      "skill_name": "Docker"
    },
    {
      "is_primary": true,
      "skill_name": "Monitoring"
    },
    {
      "is_primary": true,
      "skill_name": "Kubernetes"
    },
    {
      "is_primary": true,
      "skill_name": "Terraform"
    },
    {
      "is_primary": false,
      "skill_name": "RisingWave"
    },
    {
      "is_primary": false,
      "skill_name": "Arroyo"
    }
  ],
  "jd_role": {
    "display_name": "Senior Data Engineer",
    "rationale": null,
    "role_aliases": [
      "Data Engineer",
      "Senior Data Engineer",
      "Big Data Engineer"
    ],
    "role_archetype": "Data",
    "slug": ""
  },
  "nano_parsed": {
    "JD_type": "pass",
    "about_company": {
      "source_marker": {
        "first_5_words": "1DigitalStack.ai combines AI and deep",
        "last_5_words": "how brands win on digital shelves."
      },
      "text": "1DigitalStack.ai combines AI and deep eCommerce data to help global brands grow faster on online marketplaces. Our platforms deliver advanced analytics, actionable intelligence, and media automation \u2014 enabling brands to optimize visibility, efficiency, and sales performance at scale. We partner with India\u2019s top consumer companies \u2014 Unilever, Marico, Coca-Cola, Tata Consumer, Dabur, and Unicharm \u2014 across 125+ marketplaces globally. Backed by leading venture investors and powered by a 220+ member team, we\u2019re in our $5\u201310M growth journey, scaling rapidly across categories and geographies to redefine how brands win on digital shelves.",
      "word_count": 84
    },
    "certifications": [],
    "company_name": "1DigitalStack.ai",
    "ctc": null,
    "domain": {
      "primary": {
        "aliases": [
          "Online Retail",
          "Marketplaces"
        ],
        "domain": "E-commerce"
      },
      "secondary": null
    },
    "education": [],
    "experience": {
      "max": 7,
      "min": 5,
      "raw": "5-7 Years"
    },
    "job_locations": [
      {
        "aliases": [],
        "city": null,
        "country": "India",
        "state": null,
        "work_mode": "remote"
      }
    ],
    "role": "Senior Data Engineer",
    "role_aliases": [
      "Data Engineer",
      "Senior Data Engineer",
      "Big Data Engineer"
    ],
    "role_archetype": "Data",
    "roles_and_responsibilities": [
      {
        "bullet_count": 11,
        "heading": "Responsibilities",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "\u2022 Build and maintain high-throughput,",
          "last_5_words": "and platform teams for seamless data delivery."
        },
        "text": "\u2022 Build and maintain high-throughput, real-time data pipelines using Kafka/Pulsar with Spark,\n\n\u2022 Design fault-tolerant systems with zero-data-loss principles \u2014 checkpointing, replay logic,\n\n\u2022 Implement data observability \u2014 quality checks, SLA alerts, anomaly detection, lineage, and\n\n\u2022 Design and manage Iceberg-based lakehouse tables (Polaris/Gravitino catalogs, schema\n\n\u2022 Build fast OLAP layers using ClickHouse / StarRocks.\n\u2022 Model data across bronze \u2192 silver \u2192 gold layers for downstream teams.\n\u2022 Migrate and modernize legacy pipelines into scalable, distributed workflows.\n\u2022 Orchestrate ETL workloads using Airflow, DBT, Dagster, SQLMesh.\n\u2022 Optimize SQL transformations and distributed execution across Trino/Spark.\n\u2022 Ensure strict security and governance across all data layers \u2014 access control, encryption,\n\n\u2022 Collaborate with backend, analytics, and platform teams for seamless data delivery.",
        "word_count": 134
      },
      {
        "bullet_count": 8,
        "heading": "Core Technical Skills",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "\u2022 Extremely strong SQL \u2014 window functions,",
          "last_5_words": "with Airflow, DBT, Dagster, SQLMesh."
        },
        "text": "\u2022 Extremely strong SQL \u2014 window functions, query planning, optimization.\n\u2022 High comfort working with distributed \u0026 parallel workloads.\n\u2022 Hands-on experience with some-many of these technologies : Apache Spark, Apache Flink,\n\u2022 Advanced experience in Python (preferred) or Java (strong fundamentals).\n\u2022 Strong understanding of Parquet, Apache Iceberg, and Iceberg REST catalogs (Polaris /\n\u2022 Experience with OLAP databases \u2014 ClickHouse / StarRocks.\n\u2022 Experience with semantic layers \u2014 Cube.js or similar.\n\u2022 Strong experience building pipelines with Airflow, DBT, Dagster, SQLMesh.",
        "word_count": 104
      },
      {
        "bullet_count": 6,
        "heading": "Foundational Strengths",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "\u2022 Solid understanding of data structures",
          "last_5_words": "in compute \u0026 storage."
        },
        "text": "\u2022 Solid understanding of data structures \u0026 algorithms \u2014 sorting, searching, memory models.\n\u2022 Strong grasp of OLTP vs OLAP, indexing, query execution, and storage formats.\n\u2022 Ability to debug distributed systems end-to-end (compute, storage, network, orchestration).\n\u2022 Familiarity with cloud environments, containerization (Docker), and monitoring.\n\u2022 Experience with large-scale data \u2014 high throughput, billions of rows, large parallel workloads.\n\u2022 Awareness of cost optimization in compute \u0026 storage.",
        "word_count": 104
      },
      {
        "bullet_count": 2,
        "heading": "Good to Have",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "\u2022 Experience with emerging stream processors",
          "last_5_words": "or cloud-native big-data stacks."
        },
        "text": "\u2022 Experience with emerging stream processors \u2014 Dagster, RisingWave, Arroyo.\n\u2022 Kubernetes, Terraform, or cloud-native big-data stacks.",
        "word_count": 24
      },
      {
        "bullet_count": 4,
        "heading": "Mindset",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "\u2022 Strong ownership \u2014 takes systems",
          "last_5_words": "responsibility as the platform scales."
        },
        "text": "\u2022 Strong ownership \u2014 takes systems from design \u2192 build \u2192 monitor.\n\u2022 Self-driven, independent, and comfortable making technical decisions.\n\u2022 High attention to reliability, data accuracy, and operational excellence.\n\u2022 Naturally grows into broader technical responsibility as the platform scales.",
        "word_count": 40
      }
    ],
    "urls": [
      {
        "type": "website",
        "url": "http://www.1digitalstack.ai"
      }
    ]
  },
  "rejected": false,
  "rejection_reason": null,
  "run_id": "c2b11bf5-0ed4-4d40-b6df-b822db58604b",
  "stage3_signals": {
    "alias_found": true,
    "alias_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": null,
        "matched_count": null,
        "matched_skills": null,
        "role_id": 2,
        "score": 1.0,
        "slug": "data-engineer",
        "total_count": null
      }
    ],
    "kra_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": [
          {
            "kra_text": "Develops batch and real-time streaming data pipelines using Apache Spark, Apache Kafka, Apache Flink, or Airflow for data movement and processing at scale.",
            "sentence": "Build and maintain high-throughput, real-time data pipelines using Kafka/Pulsar with Spark,",
            "similarity": 0.7268
          },
          {
            "kra_text": "Works with data analysts, data scientists, and business stakeholders to define data models, ingestion schedules, and data delivery requirements.",
            "sentence": "Collaborate with backend, analytics, and platform teams for seamless data delivery.",
            "similarity": 0.627
          },
          {
            "kra_text": "Develops batch and real-time streaming data pipelines using Apache Spark, Apache Kafka, Apache Flink, or Airflow for data movement and processing at scale.",
            "sentence": "Orchestrate ETL workloads using Airflow, DBT, Dagster, SQLMesh.",
            "similarity": 0.6256
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 2,
        "score": 0.6598,
        "slug": "data-engineer",
        "total_count": null
      },
      {
        "display_name": "Backend Developer",
        "kra_matches": [
          {
            "kra_text": "Adds structured logging, metrics, distributed tracing, and alerting to improve system observability and support production debugging.",
            "sentence": "Implement data observability \u2014 quality checks, SLA alerts, anomaly detection, lineage, and",
            "similarity": 0.5875
          },
          {
            "kra_text": "Adds structured logging, metrics, distributed tracing, and alerting to improve system observability and support production debugging.",
            "sentence": "Ability to debug distributed systems end-to-end (compute, storage, network, orchestration).",
            "similarity": 0.5049
          },
          {
            "kra_text": "Identifies and resolves backend performance bottlenecks through query optimization, indexing strategies, connection pooling, and distributed caching with Redis.",
            "sentence": "Optimize SQL transformations and distributed execution across Trino/Spark.",
            "similarity": 0.495
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 1,
        "score": 0.5291,
        "slug": "backend-engineer",
        "total_count": null
      },
      {
        "display_name": "Cloud Architect",
        "kra_matches": [
          {
            "kra_text": "Establishes cloud governance guardrails including budget alerts, resource quotas, policy-as-code enforcement, and compliance posture management.",
            "sentence": "Ensure strict security and governance across all data layers \u2014 access control, encryption,",
            "similarity": 0.5281
          },
          {
            "kra_text": "Evaluates cloud-native managed services, serverless compute, PaaS databases, and CDN solutions for workload fit and total cost of ownership.",
            "sentence": "Awareness of cost optimization in compute \u0026 storage.",
            "similarity": 0.5227
          },
          {
            "kra_text": "Designs multi-region and multi-availability-zone cloud infrastructure architectures for high availability, fault tolerance, and horizontal scalability.",
            "sentence": "High comfort working with distributed \u0026 parallel workloads.",
            "similarity": 0.5168
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 9,
        "score": 0.5225,
        "slug": "cloud-architect",
        "total_count": null
      },
      {
        "display_name": "DevOps Engineer",
        "kra_matches": [
          {
            "kra_text": "Provisions and manages cloud infrastructure on AWS, Azure, or GCP using Terraform or CloudFormation to enforce infrastructure-as-code standards.",
            "sentence": "Kubernetes, Terraform, or cloud-native big-data stacks.",
            "similarity": 0.5323
          },
          {
            "kra_text": "Monitors CI/CD pipeline reliability, identifies bottlenecks in delivery workflows, and improves deployment frequency, lead time, and failure recovery rate.",
            "sentence": "Implement data observability \u2014 quality checks, SLA alerts, anomaly detection, lineage, and",
            "similarity": 0.5209
          },
          {
            "kra_text": "Collaborates with development teams to improve build processes, reduce deployment friction, containerize applications, and adopt DevOps best practices.",
            "sentence": "Collaborate with backend, analytics, and platform teams for seamless data delivery.",
            "similarity": 0.5046
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 10,
        "score": 0.5193,
        "slug": "devops-engineer",
        "total_count": null
      },
      {
        "display_name": "MLOps Engineer",
        "kra_matches": [
          {
            "kra_text": "Orchestrates model serving deployments to production using Kubernetes, MLflow Model Registry, SageMaker, or Kubeflow Serving infrastructure.",
            "sentence": "Orchestrate ETL workloads using Airflow, DBT, Dagster, SQLMesh.",
            "similarity": 0.5392
          },
          {
            "kra_text": "Sets up model monitoring dashboards, data drift detection, prediction performance tracking, and alert routing for production ML systems.",
            "sentence": "Implement data observability \u2014 quality checks, SLA alerts, anomaly detection, lineage, and",
            "similarity": 0.51
          },
          {
            "kra_text": "Orchestrates model serving deployments to production using Kubernetes, MLflow Model Registry, SageMaker, or Kubeflow Serving infrastructure.",
            "sentence": "Kubernetes, Terraform, or cloud-native big-data stacks.",
            "similarity": 0.4988
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 16,
        "score": 0.516,
        "slug": "ml-ops-engineer",
        "total_count": null
      }
    ],
    "skill_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": null,
        "matched_count": 12,
        "matched_skills": [
          "Anomaly detection",
          "Apache Flink",
          "Apache Spark",
          "Dagster",
          "Distributed Systems",
          "Flink",
          "Java",
          "Kafka",
          "Parquet",
          "Python",
          "SQL",
          "dbt"
        ],
        "role_id": 2,
        "score": 0.2927,
        "slug": "data-engineer",
        "total_count": 41
      },
      {
        "display_name": "ML Engineer",
        "kra_matches": null,
        "matched_count": 7,
        "matched_skills": [
          "Airflow",
          "Anomaly detection",
          "Dagster",
          "Distributed Systems",
          "Kubernetes",
          "Python",
          "Terraform"
        ],
        "role_id": 3,
        "score": 0.1707,
        "slug": "ml-engineer",
        "total_count": 41
      },
      {
        "display_name": "MLOps Engineer",
        "kra_matches": null,
        "matched_count": 6,
        "matched_skills": [
          "Airflow",
          "Anomaly detection",
          "Dagster",
          "Distributed Systems",
          "Kubernetes",
          "Python"
        ],
        "role_id": 16,
        "score": 0.1463,
        "slug": "ml-ops-engineer",
        "total_count": 41
      },
      {
        "display_name": "Backend Developer",
        "kra_matches": null,
        "matched_count": 6,
        "matched_skills": [
          "Distributed Systems",
          "Docker",
          "Java",
          "Kafka",
          "Python",
          "indexing"
        ],
        "role_id": 1,
        "score": 0.1463,
        "slug": "backend-engineer",
        "total_count": 41
      },
      {
        "display_name": "DevOps Engineer",
        "kra_matches": null,
        "matched_count": 5,
        "matched_skills": [
          "Distributed Systems",
          "Docker",
          "Kubernetes",
          "Monitoring",
          "Terraform"
        ],
        "role_id": 10,
        "score": 0.122,
        "slug": "devops-engineer",
        "total_count": 41
      }
    ]
  },
  "stage4_decision": {
    "alias_collision_detected": false,
    "case": "A",
    "chosen_role": {
      "display_name": "Data Engineer",
      "kra_matches": null,
      "matched_count": null,
      "matched_skills": null,
      "role_id": 2,
      "score": 1.0,
      "slug": "data-engineer",
      "total_count": null
    },
    "confidence": 1.0,
    "is_new_role": false,
    "llm2_fired": false,
    "llm2_reasoning": null,
    "matched_dimensions": [],
    "matched_kras": [],
    "matched_skills": [],
    "new_role_display_name": null,
    "new_role_slug": null,
    "queued": false,
    "reasoning": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.29 does not contradict",
    "sub_role": null
  },
  "stage5_updates": {
    "centroid_n_after": 84,
    "centroid_updated": true,
    "collision_log_id": null,
    "new_kra_attached": null,
    "new_skills_attached": [
      {
        "is_primary": true,
        "queue_id": 5334,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Pulsar",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5335,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Checkpointing",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5336,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Replay Logic",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5337,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Observability",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5338,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Quality",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5339,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "SLA Alerts",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5340,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Lineage",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5341,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Apache Iceberg",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5342,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Polaris",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5343,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Gravitino",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5344,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "ClickHouse",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5345,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "StarRocks",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5346,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Bronze-Silver-Gold Data Modeling",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5347,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "SQLMesh",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5348,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Trino",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5349,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Iceberg REST Catalogs",
        "status": "pending"
      },
      {
        "is_primary": false,
        "queue_id": 5350,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Cube.js",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5351,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Structures and Algorithms",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5352,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Sorting",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5353,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Searching",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5354,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Memory Models",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5355,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "OLTP",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5356,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "OLAP",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5357,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Query Execution",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 5358,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Storage Formats",
        "status": "pending"
      },
      {
        "is_primary": false,
        "queue_id": 5359,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "RisingWave",
        "status": "pending"
      },
      {
        "is_primary": false,
        "queue_id": 5360,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Arroyo",
        "status": "pending"
      }
    ],
    "queue_entry_id": null,
    "v3_pipeline_triggered": false,
    "v3_role_slug": null,
    "v3_run_id": null
  }
}

API 2 — extract-details

{}

API 3 — final-role-output

{}

LLM Calls

Every model call made for this run, in pipeline order. Click a card to see the model's response.

Loading…