Pipeline run
be82179c-6ed0-4ebe-86ad-7235eb18a2f3
Client output enrichment
v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA descriptionNature of work
—
Tech stack maturity
Mainstream Modern
AI index (0 = no AI use, 5 = totally AI-dependent · v2.1)
0.20 / 5
· Title match
✓ Has AI skill
· AI skill (primary)
· AI skill (secondary)
· On AI team
· Builds AI products
vocab breakdown (legacy)
Assistants (×1):
—
Frameworks (×2):
—
Models / concepts (×3):
ML
Evidence — skills matched in JD (21)
SQL
Python
Apache Spark
Airflow
Kafka
AWS
GCP
Azure
Scala
Java
Dagster
Parquet
Snowflake
BigQuery
Delta Lake
Iceberg
Hudi
dbt
Monte Carlo
Great Expectations
Terraform
Skill cluster (0 dimension groups, role-scoped)
Status:
extract_from_jd_done
Created: 2026-05-11T10:43:11.200993Z
Updated: 2026-05-11T10:43:11.200993Z
Flow
Current 3-step pipeline
1 POST /skills/extract-from-jd
2 POST /skills/extract-details
3 POST /skills/final-role-output
Role
Chosen role & resolution
No chosen role stored for this run.
Job description
Job Title: Data Engineer (Strict / High-Bar) Role Overview We are hiring a Data Engineer to build and scale reliable, high-performance data systems. This role requires strong ownership of data pipelines, infrastructure, and data quality. You will work on large-scale data processing, ensuring availability, consistency, and efficiency across the data platform. Core Responsibilities Design, build, and maintain scalable ETL/ELT pipelines for structured and unstructured data Develop batch and real-time data processing systems Own data modeling (star/snowflake schemas, normalization, denormalization trade-offs) Build and optimize data warehouses and data lakes Ensure data quality, validation, lineage, and observability Optimize query performance and storage (partitioning, indexing, clustering) Implement data security and governance controls Collaborate with backend, analytics, and ML teams for data consumption Automate workflows using orchestration tools Troubleshoot production data issues and ensure SLA adherence Must-Have Skills (Non-Negotiable) Strong SQL (complex joins, window functions, query optimization) Proficiency in Python / Scala / Java for data processing Hands-on with Apache Spark (or equivalent distributed processing framework) Experience with Airflow / Dagster (or similar orchestration tools) Deep understanding of data modeling and warehousing concepts Experience with streaming systems like Kafka Strong knowledge of distributed systems fundamentals Experience with cloud data platforms (AWS / GCP / Azure) Familiarity with columnar storage formats (Parquet, ORC) Preferred / High-Value Skills Experience with BigQuery / Snowflake / Redshift Knowledge of Delta Lake / Iceberg / Hudi Exposure to dbt for transformation workflows Experience with data observability tools (Monte Carlo, Great Expectations) Infrastructure as Code (Terraform) CI/CD for data pipelines Strict Requirements 3–6+ years of hands-on Data Engineering experience Must have built production-grade pipelines handling large datasets (GB–TB scale) Strong debugging and performance tuning skills Ability to write clean, testable, and maintainable code Experience working in production environments with SLAs Red Flags (Auto-Reject) Only dashboard/BI experience (Tableau/Power BI without backend pipelines) Weak SQL fundamentals No experience with distributed systems (Spark/Kafka) Only academic/project-level exposure without production systems Tech Stack (Example) Languages: Python, SQL Processing: Spark Orchestration: Airflow Storage: S3 / GCS + Parquet Warehouse: Snowflake / BigQuery Streaming: Kafka What Success Looks Like Reliable pipelines with >99.9% uptime Efficient queries with optimized cost and latency Clean, well-documented datasets trusted by downstream teams Minimal data incidents and fast recovery when issues occur
Skills from this JD
Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.
SQL
Primary
No API 2 row (run stopped after API 1 or history missing)
Python
Primary
No API 2 row (run stopped after API 1 or history missing)
Scala
Secondary
No API 2 row (run stopped after API 1 or history missing)
Java
Secondary
No API 2 row (run stopped after API 1 or history missing)
Apache Spark
Primary
No API 2 row (run stopped after API 1 or history missing)
Airflow
Primary
No API 2 row (run stopped after API 1 or history missing)
Dagster
Secondary
No API 2 row (run stopped after API 1 or history missing)
Kafka
Primary
No API 2 row (run stopped after API 1 or history missing)
AWS
Primary
No API 2 row (run stopped after API 1 or history missing)
GCP
Primary
No API 2 row (run stopped after API 1 or history missing)
Azure
Primary
No API 2 row (run stopped after API 1 or history missing)
Parquet
Secondary
No API 2 row (run stopped after API 1 or history missing)
Snowflake
Secondary
No API 2 row (run stopped after API 1 or history missing)
BigQuery
Secondary
No API 2 row (run stopped after API 1 or history missing)
Delta Lake
Secondary
No API 2 row (run stopped after API 1 or history missing)
Iceberg
Secondary
No API 2 row (run stopped after API 1 or history missing)
Hudi
Secondary
No API 2 row (run stopped after API 1 or history missing)
dbt
Secondary
No API 2 row (run stopped after API 1 or history missing)
Monte Carlo
Secondary
No API 2 row (run stopped after API 1 or history missing)
Great Expectations
Secondary
No API 2 row (run stopped after API 1 or history missing)
Terraform
Secondary
No API 2 row (run stopped after API 1 or history missing)
Library artifacts (this run)
No artifact rows for this run.
API 1 — extract-from-jd click to toggle
{
"final_skills": [
{
"is_primary": true,
"skill_name": "SQL"
},
{
"is_primary": true,
"skill_name": "Python"
},
{
"is_primary": false,
"skill_name": "Scala"
},
{
"is_primary": false,
"skill_name": "Java"
},
{
"is_primary": true,
"skill_name": "Apache Spark"
},
{
"is_primary": true,
"skill_name": "Airflow"
},
{
"is_primary": false,
"skill_name": "Dagster"
},
{
"is_primary": true,
"skill_name": "Kafka"
},
{
"is_primary": true,
"skill_name": "AWS"
},
{
"is_primary": true,
"skill_name": "GCP"
},
{
"is_primary": true,
"skill_name": "Azure"
},
{
"is_primary": false,
"skill_name": "Parquet"
},
{
"is_primary": false,
"skill_name": "Snowflake"
},
{
"is_primary": false,
"skill_name": "BigQuery"
},
{
"is_primary": false,
"skill_name": "Delta Lake"
},
{
"is_primary": false,
"skill_name": "Iceberg"
},
{
"is_primary": false,
"skill_name": "Hudi"
},
{
"is_primary": false,
"skill_name": "dbt"
},
{
"is_primary": false,
"skill_name": "Monte Carlo"
},
{
"is_primary": false,
"skill_name": "Great Expectations"
},
{
"is_primary": false,
"skill_name": "Terraform"
}
],
"run_id": null
}
API 2 — extract-details
{}
API 3 — final-role-output
{}
LLM Calls
Every model call made for this run, in pipeline order. Click a card to see the model's response.
Loading…