← Back to history

Pipeline run

64d5bf86-3685-447f-8f2e-fbb0580362d0

Client output enrichment

v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA description
Nature of work
no_db_connection
Tech stack maturity
Mainstream Modern
AI index (0 = no AI use, 5 = totally AI-dependent · v2.1)
0.20 / 5
· Title match
Has AI skill
· AI skill (primary)
· AI skill (secondary)
· On AI team
· Builds AI products
vocab breakdown (legacy)
Assistants (×1):
Frameworks (×2):
Models / concepts (×3): ML
Evidence — skills matched in JD (23)
SQL Python Apache Spark Airflow Kafka AWS GCP Azure Scala Java Dagster Parquet ORC BigQuery Snowflake Redshift Delta Lake Iceberg Hudi dbt Monte Carlo Great Expectations Terraform
Skill cluster (0 dimension groups, role-scoped)
No dimension groups computed for this JD.
Status: extract_from_jd_done Created: 2026-05-11T11:19:33.245981Z Updated: 2026-05-11T11:19:33.245981Z
Flow Current 3-step pipeline

1 POST /skills/extract-from-jd

2 POST /skills/extract-details

3 POST /skills/final-role-output

Role Chosen role & resolution

No chosen role stored for this run.

Job description

Job Title: Data Engineer (Strict / High-Bar)
Role Overview

We are hiring a Data Engineer to build and scale reliable, high-performance data systems. This role requires strong ownership of data pipelines, infrastructure, and data quality. You will work on large-scale data processing, ensuring availability, consistency, and efficiency across the data platform.

Core Responsibilities
Design, build, and maintain scalable ETL/ELT pipelines for structured and unstructured data
Develop batch and real-time data processing systems
Own data modeling (star/snowflake schemas, normalization, denormalization trade-offs)
Build and optimize data warehouses and data lakes
Ensure data quality, validation, lineage, and observability
Optimize query performance and storage (partitioning, indexing, clustering)
Implement data security and governance controls
Collaborate with backend, analytics, and ML teams for data consumption
Automate workflows using orchestration tools
Troubleshoot production data issues and ensure SLA adherence
Must-Have Skills (Non-Negotiable)
Strong SQL (complex joins, window functions, query optimization)
Proficiency in Python / Scala / Java for data processing
Hands-on with Apache Spark (or equivalent distributed processing framework)
Experience with Airflow / Dagster (or similar orchestration tools)
Deep understanding of data modeling and warehousing concepts
Experience with streaming systems like Kafka
Strong knowledge of distributed systems fundamentals
Experience with cloud data platforms (AWS / GCP / Azure)
Familiarity with columnar storage formats (Parquet, ORC)
Preferred / High-Value Skills
Experience with BigQuery / Snowflake / Redshift
Knowledge of Delta Lake / Iceberg / Hudi
Exposure to dbt for transformation workflows
Experience with data observability tools (Monte Carlo, Great Expectations)
Infrastructure as Code (Terraform)
CI/CD for data pipelines
Strict Requirements
3–6+ years of hands-on Data Engineering experience
Must have built production-grade pipelines handling large datasets (GB–TB scale)
Strong debugging and performance tuning skills
Ability to write clean, testable, and maintainable code
Experience working in production environments with SLAs
Red Flags (Auto-Reject)
Only dashboard/BI experience (Tableau/Power BI without backend pipelines)
Weak SQL fundamentals
No experience with distributed systems (Spark/Kafka)
Only academic/project-level exposure without production systems
Tech Stack (Example)
Languages: Python, SQL
Processing: Spark
Orchestration: Airflow
Storage: S3 / GCS + Parquet
Warehouse: Snowflake / BigQuery
Streaming: Kafka
What Success Looks Like
Reliable pipelines with >99.9% uptime
Efficient queries with optimized cost and latency
Clean, well-documented datasets trusted by downstream teams
Minimal data incidents and fast recovery when issues occur

Skills from this JD

Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.

SQL Primary No API 2 row (run stopped after API 1 or history missing)
Python Primary No API 2 row (run stopped after API 1 or history missing)
Scala Secondary No API 2 row (run stopped after API 1 or history missing)
Java Secondary No API 2 row (run stopped after API 1 or history missing)
Apache Spark Primary No API 2 row (run stopped after API 1 or history missing)
Airflow Primary No API 2 row (run stopped after API 1 or history missing)
Dagster Secondary No API 2 row (run stopped after API 1 or history missing)
Kafka Primary No API 2 row (run stopped after API 1 or history missing)
AWS Primary No API 2 row (run stopped after API 1 or history missing)
GCP Primary No API 2 row (run stopped after API 1 or history missing)
Azure Primary No API 2 row (run stopped after API 1 or history missing)
Parquet Secondary No API 2 row (run stopped after API 1 or history missing)
ORC Secondary No API 2 row (run stopped after API 1 or history missing)
BigQuery Secondary No API 2 row (run stopped after API 1 or history missing)
Snowflake Secondary No API 2 row (run stopped after API 1 or history missing)
Redshift Secondary No API 2 row (run stopped after API 1 or history missing)
Delta Lake Secondary No API 2 row (run stopped after API 1 or history missing)
Iceberg Secondary No API 2 row (run stopped after API 1 or history missing)
Hudi Secondary No API 2 row (run stopped after API 1 or history missing)
dbt Secondary No API 2 row (run stopped after API 1 or history missing)
Monte Carlo Secondary No API 2 row (run stopped after API 1 or history missing)
Great Expectations Secondary No API 2 row (run stopped after API 1 or history missing)
Terraform Secondary No API 2 row (run stopped after API 1 or history missing)

Library artifacts (this run)

No artifact rows for this run.
API 1 — extract-from-jd click to toggle
{
  "final_skills": [
    {
      "is_primary": true,
      "skill_name": "SQL"
    },
    {
      "is_primary": true,
      "skill_name": "Python"
    },
    {
      "is_primary": false,
      "skill_name": "Scala"
    },
    {
      "is_primary": false,
      "skill_name": "Java"
    },
    {
      "is_primary": true,
      "skill_name": "Apache Spark"
    },
    {
      "is_primary": true,
      "skill_name": "Airflow"
    },
    {
      "is_primary": false,
      "skill_name": "Dagster"
    },
    {
      "is_primary": true,
      "skill_name": "Kafka"
    },
    {
      "is_primary": true,
      "skill_name": "AWS"
    },
    {
      "is_primary": true,
      "skill_name": "GCP"
    },
    {
      "is_primary": true,
      "skill_name": "Azure"
    },
    {
      "is_primary": false,
      "skill_name": "Parquet"
    },
    {
      "is_primary": false,
      "skill_name": "ORC"
    },
    {
      "is_primary": false,
      "skill_name": "BigQuery"
    },
    {
      "is_primary": false,
      "skill_name": "Snowflake"
    },
    {
      "is_primary": false,
      "skill_name": "Redshift"
    },
    {
      "is_primary": false,
      "skill_name": "Delta Lake"
    },
    {
      "is_primary": false,
      "skill_name": "Iceberg"
    },
    {
      "is_primary": false,
      "skill_name": "Hudi"
    },
    {
      "is_primary": false,
      "skill_name": "dbt"
    },
    {
      "is_primary": false,
      "skill_name": "Monte Carlo"
    },
    {
      "is_primary": false,
      "skill_name": "Great Expectations"
    },
    {
      "is_primary": false,
      "skill_name": "Terraform"
    }
  ],
  "run_id": null
}
API 2 — extract-details
{}
API 3 — final-role-output
{}

LLM Calls

Every model call made for this run, in pipeline order. Click a card to see the model's response.

Loading…