Pipeline run
d1284c9b-3959-4f53-b9f9-09085e1072b9
Client output enrichment
v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA descriptionvocab breakdown (legacy)
Signals
Post-classification
Captured for admin review
1 POST /skills/extract-from-jd
2 POST /skills/extract-details
3 POST /skills/final-role-output
Data Engineer
CASE Aslug: data-engineer · id: 2 · source: db
The primary skills require expertise in data processing and orchestration tools, fitting the Data Engineer role well.
Resolution:
in_db
— role exists in library; skill↔dim and role↔dim links saved when applicable.
Job description
Data Platform Engineer What You ll Do You will play a critical role in expanding and optimizing our data platform and reporting capabilities You will work on the development of a scalable high-impact high volume data systems powering reporting and analytics for our advertising business This is a multi-faceted role requiring expertise in backend service development streaming and batch processing and operational excellence Your responsibilities will include Design architecture and development of high-scale data pipelines and backend services for data processing and storage Work closely with product teams to understand data needs and translate them into reliable performant systems Design and implement batch and real-time processing pipelines using modern big data tools e g Spark Flink Kafka Airflow Drive data modeling best practices to ensure consistent extensible data definitions across the organization Ensure data quality correctness and completeness through robust monitoring validation and testing strategies Mentor junior engineers foster engineering excellence and help shape technical direction across the broader organization Partner with infrastructure and platform teams to ensure systems are cost-efficient observable and resilient at scale Develop and enforce data engineering security data quality standards through automation Participate in supporting platform 24X7 Be passionate about growing a team - hire and mentor engineers What to Bring Bachelor s degree in computer science or similar discipline 11 - 15 years of experience in software engineering with a strong background in data-intensive systems Deep experience with distributed data processing frameworks e g Apache Spark Beam Flink Proficiency in one or more programming languages such as Java Python or Go Strong understanding of data modelling ETL best practices and big data architecture Experience building reporting pipelines or systems that support forecasting attribution reach frequency or audience measurement Exposure to ML-based forecasting systems or time-series modelling Expertise in building and managing large volume stream or batch processing platform is a must Experience working with data warehousing solutions like Databricks Practical knowledge of with CI CD pipelines preferably GitHub Actions Workflows Experience with Microservice Architecture principles and implementations Practical knowledge of containerization and orchestration platforms like Docker and EKS Experience with cloud services especially AWS is highly desirable Strong interpersonal communication and presentation skills
Skills from this JD
Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.
Skill enrichment (orchestrator / LLM)
Apache Spark appears in many data engineering and analytics job descriptions and remains a standard big-data processing stack alongside Databricks and Hadoop ecosystems.
Apache Software Foundation ·apache_2 ·since 2010 (0.95)
“Spark” in JDs typically refers to Apache Spark for data processing; other common meanings are less likely in this engineering context.
Versioned 3.5
{
"apache spark 3": "3",
"apache spark 3.5": "3.5",
"spark 3": "3",
"spark 3.5": "3.5",
"spark 3.x": "3",
"spark3": "3",
"spark3.5": "3.5"
}
Framework ·data_processing_framework confidence 0.93
Spark is fundamentally a distributed application framework that users build data-processing jobs inside, not a standalone tool they merely operate.
- Category
- Framework
- Sub-category
- data_processing_framework
- Skill nature
- FRAMEWORK
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- SEPARATE_ENTITY
Dimensions (API 2 worklist)
-
React Frontend Development Catalog dimension db id 96
Library dimension (catalog)
-
Systems Programming Catalog dimension db id 166
Library dimension (catalog)
Locked dimensions (v3 placement)
-
Distributed Data Processing
Pipeline tentative id
Batch and streaming data processing frameworks used to transform large datasets across clusters. Spark belongs here because it is a core engine for distributed ETL, analytics, and scalable data pipelines.
-
Big Data Analytics Engines
Pipeline tentative id
Large-scale analytics engines used to query, transform, and aggregate data on distributed storage. Spark fits because it is commonly used as the execution layer for big data batch analytics and interactive processing.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
React Frontend Development
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Systems Programming
d_init_02
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Apache Flink appears in many data/streaming job postings and is a standard choice alongside Kafka/Spark for real-time ETL; its GitHub and vendor ecosystem remain active, indicating broad adoption.
Apache Software Foundation ·apache_2 ·since 2014 (0.95)
“Flink” in JDs typically refers specifically to Apache Flink (stream/batch processing), not another catalog skill with a similar name.
Versioned 1.20
{
"Apache Flink": "1.20",
"Flink 1.20": "1.20",
"Flink 1.20.x": "1.20",
"Flink 1.x": "1.20"
}
Framework ·stream_processing_framework confidence 0.90
Flink is fundamentally a structured distributed processing framework that developers build stream and batch applications on, rather than a standalone tool they merely operate.
- Category
- Framework
- Sub-category
- stream_processing_framework
- Skill nature
- FRAMEWORK
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- SEPARATE_ENTITY
Dimensions (API 2 worklist)
-
ETL and ELT Tooling Catalog dimension db id 24
Library dimension (catalog)
Roles linked in library: Data Engineer
-
React Frontend Development Catalog dimension db id 96
Library dimension (catalog)
Locked dimensions (v3 placement)
-
Stream Processing Frameworks
Reuses catalog slug
Frameworks used to build and operate batch and streaming data pipelines. Flink belongs here because it is a core engine for stateful stream processing, event-time handling, and real-time ETL in data platforms.
-
Distributed Stream Processing
Pipeline tentative id
Distributed engines and concepts for processing high-volume event streams with state, fault tolerance, and low latency. Flink fits here because it is widely used as a distributed runtime for continuous data pipelines and real-time analytics.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
ETL and ELT Tooling
etl-and-elt-tooling
|
✓ | ✓ | New skill saved · Existing dimension (library) · Role↔dimension saved |
|
React Frontend Development
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — catalog
- Kafka (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Datastore
- Sub-category
- Event Stream Store
- Vendor
- Confluent
- License
- apache_2
- Year introduced
- 2011
- Confidence
- 0.90
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Kafka appears in many production JDs for event streaming and data pipelines, and remains a standard platform in cloud/vendor offerings (e.g., Confluent, AWS MSK), indicating broad hiring demand.
Skill profile (library / DB)
- Skill nature
- PLATFORM
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 9
- Sub-category id
- 47
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Messaging and Event Streaming Catalog dimension db id 8
Library dimension (catalog)
Roles linked in library: Backend Engineer, Data Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Messaging and Event Streaming
messaging-and-event-streaming
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved |
Aliases — catalog
- Airflow (CANONICAL) primary
- airflow 2 (VERSION)
- airflow-2 (VERSION)
- airflow2 (VERSION)
- airflow2.x (VERSION)
- apache airflow 2 (VERSION)
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Tool
- Sub-category
- Workflow Orchestration Tool
- Vendor
- Apache Software Foundation
- License
- apache_2
- Year introduced
- 2014
- Confidence
- 0.95
- Version strategy
- SEPARATE_ENTITY
- Version tag
- 2.x
Maturity reasoning: Apache Airflow appears in many data engineering job postings and is a common orchestration choice in production stacks; its GitHub activity and ecosystem remain strong, with no vendor sunset or clear replacement dominating JDs.
Skill profile (library / DB)
- Skill nature
- TOOL
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 13
- Sub-category id
- 130
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Workflow Orchestration for ML Pipelines Catalog dimension db id 54
Library dimension (catalog)
Roles linked in library: ML Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Workflow Orchestration for ML Pipelines
workflow-orchestration-for-ml-pipelines
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
All API 3 persistence rows
Same grid as the skill-extractor “Persistence items” table: one row per (skill × dimension) work item.
| Skill | Tag | Dimension | Skill↔dim | Role↔dim | Outcome | Notes |
|---|---|---|---|---|---|---|
| Kafka | in_db |
Messaging and Event Streaming
messaging-and-event-streaming
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved | |
| Airflow | in_db |
Workflow Orchestration for ML Pipelines
workflow-orchestration-for-ml-pipelines
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Spark | in_db |
React Frontend Development
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Spark | in_db |
Systems Programming
d_init_02
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Flink | in_db |
ETL and ELT Tooling
etl-and-elt-tooling
|
✓ | ✓ | New skill saved · Existing dimension (library) · Role↔dimension saved | |
| Flink | in_db |
React Frontend Development
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Library artifacts (this run)
| Kind | Detail | DB id |
|---|---|---|
| canonical_skill_added | Spark | 1348 |
| canonical_skill_added | Flink | 1349 |
| dimension_skill_link | Spark ↔ React Frontend Development | 96 |
| dimension_skill_link | Spark ↔ Systems Programming | 166 |
| dimension_skill_link | Flink ↔ ETL and ELT Tooling | 24 |
| dimension_skill_link | Flink ↔ React Frontend Development | 96 |
nano JD Parser — gpt-4.1-nano click to toggle
Show raw JSON
{
"JD_type": "pass",
"about_company": null,
"certifications": [],
"company_name": null,
"ctc": null,
"domain": {
"primary": {
"aliases": [],
"domain": "Other"
},
"secondary": null
},
"education": [
{
"level": "Bachelor\u0027s",
"qualification": "BTECH/BE - Computer Science (or similar)",
"raw": "Bachelor s degree in computer science or similar discipline",
"requirement": "required"
}
],
"experience": {
"max": 15,
"min": 11,
"raw": "11 - 15 years of experience"
},
"job_locations": [],
"role": "Data Platform Engineer",
"role_archetype": "Data",
"roles_and_responsibilities": [
{
"bullet_count": 10,
"heading": "What You ll Do",
"heading_was_present": true,
"source_marker": {
"first_5_words": "You will play a critical",
"last_5_words": "hire and mentor engineers."
},
"text": "You will play a critical role in expanding and optimizing our data platform and reporting capabilities. You will work on the development of a scalable high-impact high volume data systems powering reporting and analytics for our advertising business. This is a multi-faceted role requiring expertise in backend service development, streaming and batch processing, and operational excellence. Your responsibilities will include:\n\nDesign architecture and development of high-scale data pipelines and backend services for data processing and storage.\nWork closely with product teams to understand data needs and translate them into reliable performant systems.\nDesign and implement batch and real-time processing pipelines using modern big data tools (e.g., Spark, Flink, Kafka, Airflow).\nDrive data modeling best practices to ensure consistent extensible data definitions across the organization.\nEnsure data quality, correctness, and completeness through robust monitoring, validation, and testing strategies.\nMentor junior engineers, foster engineering excellence, and help shape technical direction across the broader organization.\nPartner with infrastructure and platform teams to ensure systems are cost-efficient, observable, and resilient at scale.\nDevelop and enforce data engineering security, data quality standards through automation.\nParticipate in supporting platform 24X7.\nBe passionate about growing a team - hire and mentor engineers.",
"word_count": 263
}
],
"urls": []
}
API 1 — extract-from-jd click to toggle
{
"final_skills": [
{
"is_primary": true,
"skill_name": "Spark"
},
{
"is_primary": true,
"skill_name": "Flink"
},
{
"is_primary": true,
"skill_name": "Kafka"
},
{
"is_primary": true,
"skill_name": "Airflow"
}
],
"jd_role": {
"display_name": "Data Platform Engineer",
"rationale": null,
"role_archetype": "Data",
"slug": ""
},
"nano_parsed": {
"JD_type": "pass",
"about_company": null,
"certifications": [],
"company_name": null,
"ctc": null,
"domain": {
"primary": {
"aliases": [],
"domain": "Other"
},
"secondary": null
},
"education": [
{
"level": "Bachelor\u0027s",
"qualification": "BTECH/BE - Computer Science (or similar)",
"raw": "Bachelor s degree in computer science or similar discipline",
"requirement": "required"
}
],
"experience": {
"max": 15,
"min": 11,
"raw": "11 - 15 years of experience"
},
"job_locations": [],
"role": "Data Platform Engineer",
"role_archetype": "Data",
"roles_and_responsibilities": [
{
"bullet_count": 10,
"heading": "What You ll Do",
"heading_was_present": true,
"source_marker": {
"first_5_words": "You will play a critical",
"last_5_words": "hire and mentor engineers."
},
"text": "You will play a critical role in expanding and optimizing our data platform and reporting capabilities. You will work on the development of a scalable high-impact high volume data systems powering reporting and analytics for our advertising business. This is a multi-faceted role requiring expertise in backend service development, streaming and batch processing, and operational excellence. Your responsibilities will include:\n\nDesign architecture and development of high-scale data pipelines and backend services for data processing and storage.\nWork closely with product teams to understand data needs and translate them into reliable performant systems.\nDesign and implement batch and real-time processing pipelines using modern big data tools (e.g., Spark, Flink, Kafka, Airflow).\nDrive data modeling best practices to ensure consistent extensible data definitions across the organization.\nEnsure data quality, correctness, and completeness through robust monitoring, validation, and testing strategies.\nMentor junior engineers, foster engineering excellence, and help shape technical direction across the broader organization.\nPartner with infrastructure and platform teams to ensure systems are cost-efficient, observable, and resilient at scale.\nDevelop and enforce data engineering security, data quality standards through automation.\nParticipate in supporting platform 24X7.\nBe passionate about growing a team - hire and mentor engineers.",
"word_count": 263
}
],
"urls": []
},
"rejected": false,
"rejection_reason": null,
"run_id": "d1284c9b-3959-4f53-b9f9-09085e1072b9",
"stage3_signals": {
"alias_match_roles": [
{
"display_name": "Data Engineer",
"matched_count": null,
"role_id": 2,
"score": 0.6087,
"slug": "data-engineer",
"total_count": null
},
{
"display_name": "ML Engineer",
"matched_count": null,
"role_id": 3,
"score": 0.3462,
"slug": "ml-engineer",
"total_count": null
},
{
"display_name": "Frontend Engineer",
"matched_count": null,
"role_id": 7,
"score": 0.3462,
"slug": "frontend-engineer",
"total_count": null
},
{
"display_name": "AR/VR Engineer",
"matched_count": null,
"role_id": 8,
"score": 0.3462,
"slug": "ar-vr-engineer",
"total_count": null
},
{
"display_name": "AI Engineer",
"matched_count": null,
"role_id": 13,
"score": 0.3462,
"slug": "ai-engineer",
"total_count": null
}
],
"kra_match_roles": [
{
"display_name": "Data Engineer",
"matched_count": null,
"role_id": 2,
"score": 0.4618,
"slug": "data-engineer",
"total_count": null
},
{
"display_name": "Android Engineer",
"matched_count": null,
"role_id": 4,
"score": 0.4519,
"slug": "android-engineer",
"total_count": null
},
{
"display_name": "Backend Engineer",
"matched_count": null,
"role_id": 1,
"score": 0.4271,
"slug": "backend-engineer",
"total_count": null
},
{
"display_name": "AR/VR Engineer",
"matched_count": null,
"role_id": 8,
"score": 0.4137,
"slug": "ar-vr-engineer",
"total_count": null
},
{
"display_name": "Cloud Architect",
"matched_count": null,
"role_id": 9,
"score": 0.4114,
"slug": "cloud-architect",
"total_count": null
}
],
"skill_match_roles": [
{
"display_name": "Backend Engineer",
"matched_count": 1,
"role_id": 1,
"score": 0.25,
"slug": "backend-engineer",
"total_count": 4
},
{
"display_name": "Data Engineer",
"matched_count": 1,
"role_id": 2,
"score": 0.25,
"slug": "data-engineer",
"total_count": 4
},
{
"display_name": "ML Engineer",
"matched_count": 1,
"role_id": 3,
"score": 0.25,
"slug": "ml-engineer",
"total_count": 4
}
],
"stage35_ran": false
},
"stage4_decision": {
"alias_collision_detected": false,
"case": "A",
"chosen_role": {
"display_name": "Data Engineer",
"matched_count": null,
"role_id": 2,
"score": 1.0,
"slug": "data-engineer",
"total_count": null
},
"confidence": 0.4618,
"llm2_fired": false,
"llm2_reasoning": null,
"queued": false,
"reasoning": "Stage 1 title \u0027Data Engineer\u0027 (embedding match, sim 0.79); KRA agrees (0.46)"
},
"stage5_updates": {
"centroid_n_after": 15,
"centroid_updated": true,
"collision_log_id": null,
"new_kra_attached": null,
"new_skills_attached": [
{
"is_primary": true,
"queue_id": 1086,
"role_display_name": "Data Engineer",
"role_slug": "data-engineer",
"skill_name": "Spark",
"status": "pending"
},
{
"is_primary": true,
"queue_id": 1087,
"role_display_name": "Data Engineer",
"role_slug": "data-engineer",
"skill_name": "Flink",
"status": "pending"
}
],
"queue_entry_id": null,
"v3_pipeline_triggered": false,
"v3_role_slug": null,
"v3_run_id": null
}
}
API 2 — extract-details
{
"alias_matches": [
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 173,
"existing_alias_text": "Kafka",
"input_term": "Kafka",
"matched_canonical": {
"category_id": 9,
"display_name": "Kafka",
"id": 36,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "kafka",
"sub_category_id": 47,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 526,
"existing_alias_text": "Airflow",
"input_term": "Airflow",
"matched_canonical": {
"category_id": 13,
"display_name": "Airflow",
"id": 265,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "TOOL",
"slug": "airflow",
"sub_category_id": 130,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
}
],
"candidate_roles": [
{
"display_name": "Backend Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A Backend Engineer designs, builds, and maintains the server-side logic and data handling that power applications and services. They focus on implementing reliable business functionality, integrating with other systems, and ensuring the backend is scalable, maintainable, and observable.",
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
},
{
"display_name": "ML Engineer",
"id": 3,
"rationale": null,
"role_archetype": null,
"slug": "ml-engineer",
"source": "db"
}
],
"chosen_role": {
"display_name": "Data Engineer",
"id": 2,
"rationale": "The primary skills require expertise in data processing and orchestration tools, fitting the Data Engineer role well.",
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 8,
"rationale": "Transport-layer systems used to move events and decouple producers from consumers. Data engineers use these systems to ingest, buffer, and distribute event data before downstream processing.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Kafka",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A Backend Engineer designs, builds, and maintains the server-side logic and data handling that power applications and services. They focus on implementing reliable business functionality, integrating with other systems, and ensuring the backend is scalable, maintainable, and observable.",
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Workflow Orchestration for ML Pipelines",
"id": 54,
"rationale": "Workflow engines used to coordinate training, evaluation, deployment, and retraining jobs. This cluster covers dependencies, retries, scheduling, and pipeline composition for ML lifecycle automation.",
"slug": "workflow-orchestration-for-ml-pipelines",
"source": "db"
},
"input_skill": "Airflow",
"llm_role": null,
"roles_from_db": [
{
"display_name": "ML Engineer",
"id": 3,
"rationale": null,
"role_archetype": null,
"slug": "ml-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "React Frontend Development",
"id": 96,
"rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Spark",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Systems Programming",
"id": 166,
"rationale": "Systems programming covers low-level software development where performance, memory safety, and direct control over resources matter. Rust fits here because it is commonly used for OS-adjacent services, infrastructure components, and other performance-sensitive systems code.",
"slug": "d_init_02",
"source": "db"
},
"input_skill": "Spark",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "ETL and ELT Tooling",
"id": 24,
"rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
"slug": "etl-and-elt-tooling",
"source": "db"
},
"input_skill": "Flink",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "React Frontend Development",
"id": 96,
"rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Flink",
"llm_role": null,
"roles_from_db": []
}
],
"input_final_skills": [
"Spark",
"Flink",
"Kafka",
"Airflow"
],
"input_llm_skills": [
"Spark",
"Flink",
"Kafka",
"Airflow"
],
"new_aliases_persisted": 0,
"run_id": "d1284c9b-3959-4f53-b9f9-09085e1072b9",
"skills_detail": [
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "React Frontend Development",
"id": 96,
"rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Spark",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Systems Programming",
"id": 166,
"rationale": "Systems programming covers low-level software development where performance, memory safety, and direct control over resources matter. Rust fits here because it is commonly used for OS-adjacent services, infrastructure components, and other performance-sensitive systems code.",
"slug": "d_init_02",
"source": "db"
},
"input_skill": "Spark",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Spark",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Framework",
"skill_nature": "FRAMEWORK",
"sub_category": "data_processing_framework",
"typical_lifespan": "EVERGREEN",
"version_strategy": "SEPARATE_ENTITY",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cSpark\u201d in JDs typically refers to Apache Spark for data processing; other common meanings are less likely in this engineering context."
},
"context_keywords": {
"context_keywords": [
"Hadoop",
"RDD",
"DataFrame",
"Spark SQL",
"MLlib",
"Streaming",
"PySpark",
"Cluster",
"Resilient Distributed Dataset",
"GraphX",
"Apache",
"ETL",
"Big Data",
"Scala",
"Java"
]
},
"maturity": {
"confidence": 0.95,
"maturity": "well_known",
"reasoning": "Apache Spark appears in many data engineering and analytics job descriptions and remains a standard big-data processing stack alongside Databricks and Hadoop ecosystems."
},
"skill_id": "spark",
"vendor_license": {
"confidence": 0.95,
"license": "apache_2",
"vendor": "Apache Software Foundation",
"year_introduced": 2010
},
"versioning": {
"current_version": "3.5",
"version_aliases": {
"apache spark 3": "3",
"apache spark 3.5": "3.5",
"spark 3": "3",
"spark 3.5": "3.5",
"spark 3.x": "3",
"spark3": "3",
"spark3.5": "3.5"
},
"versioned": true
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Batch and streaming data processing frameworks used to transform large datasets across clusters. Spark belongs here because it is a core engine for distributed ETL, analytics, and scalable data pipelines.",
"exemplar_skills": [
"Spark",
"Spark SQL",
"DataFrame API",
"RDDs",
"Structured Streaming",
"PySpark",
"shuffle optimization",
"partition tuning"
],
"in_scope": "Spark, Spark SQL, DataFrame API, RDDs, Structured Streaming, cluster execution, shuffle, partitioning, joins, window functions, broadcast joins, UDFs, PySpark",
"name": "Distributed Data Processing",
"out_of_scope": "Workflow orchestration tools like Airflow, connector-first ETL products, and warehouse modeling belong to ETL and ELT Tooling; low-level JVM or Python language syntax belongs to Programming Languages and Scripting",
"overlap_flags": [
{
"reason": "Spark is often used inside ETL/ELT pipelines, but this dimension is about the processing engine rather than orchestration or packaged ingestion tools.",
"with_dim_id": "etl-and-elt-tooling",
"with_dim_name": null,
"with_role": "Data Engineer"
},
{
"reason": "Spark tuning frequently involves performance work, but the primary focus here is distributed data processing semantics and APIs.",
"with_dim_id": "performance-and-scalability-tuning",
"with_dim_name": null,
"with_role": "Backend Engineer"
}
],
"tentative_id": "d_init_01"
},
{
"description": "Large-scale analytics engines used to query, transform, and aggregate data on distributed storage. Spark fits because it is commonly used as the execution layer for big data batch analytics and interactive processing.",
"exemplar_skills": [
"Spark",
"Spark SQL",
"distributed aggregations",
"large-scale joins",
"parquet processing",
"batch analytics",
"interactive queries"
],
"in_scope": "Spark, Spark SQL, distributed aggregations, large-scale joins, parquet processing, cluster-based analytics, notebook-driven exploration, batch analytics, interactive queries",
"name": "Big Data Analytics Engines",
"out_of_scope": "Standalone BI dashboards and semantic layers belong to BI and Visualization Tools; storage systems like data lakes and warehouses belong to Cloud Storage and Data Services",
"overlap_flags": [
{
"reason": "Spark commonly reads from and writes to cloud data stores, but the engine itself is the analytics layer rather than the storage layer.",
"with_dim_id": "cloud-storage-and-data-services",
"with_dim_name": null,
"with_role": "Cloud Architect"
}
],
"tentative_id": "d_init_02"
}
],
"merge_log": [],
"placed": {
"name": "Spark",
"placement_confidence": 0.92,
"primary_dimension": "d_init_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 2 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"d_init_02"
],
"skill_id": "spark"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"databricks",
"aws",
"azure",
"kubernetes",
"jvm",
"sqlite",
"git",
"github"
],
"requires": [],
"skill_id": "spark",
"suppress_on_match": []
},
"skill_id": "spark",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.93,
"name": "Spark",
"reasoning": "Spark is fundamentally a distributed application framework that users build data-processing jobs inside, not a standalone tool they merely operate.",
"skill_id": "spark",
"subtype": "data_processing_framework",
"type": "Framework"
},
"warnings": []
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "ETL and ELT Tooling",
"id": 24,
"rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
"slug": "etl-and-elt-tooling",
"source": "db"
},
"input_skill": "Flink",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "React Frontend Development",
"id": 96,
"rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Flink",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Flink",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Framework",
"skill_nature": "FRAMEWORK",
"sub_category": "stream_processing_framework",
"typical_lifespan": "EVERGREEN",
"version_strategy": "SEPARATE_ENTITY",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cFlink\u201d in JDs typically refers specifically to Apache Flink (stream/batch processing), not another catalog skill with a similar name."
},
"context_keywords": {
"context_keywords": [
"Kafka",
"streaming",
"data pipeline",
"event time",
"windowing",
"stateful processing",
"checkpointing",
"Flink SQL",
"Apache Beam",
"dataflow",
"real-time analytics",
"backpressure",
"flink-connector",
"flink-ml",
"flink-runtime"
]
},
"maturity": {
"confidence": 0.84,
"maturity": "well_known",
"reasoning": "Apache Flink appears in many data/streaming job postings and is a standard choice alongside Kafka/Spark for real-time ETL; its GitHub and vendor ecosystem remain active, indicating broad adoption."
},
"skill_id": "flink",
"vendor_license": {
"confidence": 0.95,
"license": "apache_2",
"vendor": "Apache Software Foundation",
"year_introduced": 2014
},
"versioning": {
"current_version": "1.20",
"version_aliases": {
"Apache Flink": "1.20",
"Flink 1.20": "1.20",
"Flink 1.20.x": "1.20",
"Flink 1.x": "1.20"
},
"versioned": true
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Frameworks used to build and operate batch and streaming data pipelines. Flink belongs here because it is a core engine for stateful stream processing, event-time handling, and real-time ETL in data platforms.",
"exemplar_skills": [
"Flink",
"Apache Flink",
"stream processing",
"event-time processing",
"windowing",
"checkpointing",
"watermarks",
"stateful stream processing",
"real-time ETL"
],
"in_scope": "Flink, Apache Flink, stream processing jobs, event-time processing, windowing, stateful transformations, checkpointing, watermarks, connectors, sink and source integration, real-time ETL, batch processing with Flink",
"name": "Stream Processing Frameworks",
"out_of_scope": "SQL-only transformations and warehouse modeling, which belong to analytics engineering; low-level distributed systems internals, which belong to platform architecture; orchestration of scheduled workflows, which belongs to workflow orchestration tools",
"overlap_flags": [
{
"reason": "Flink uses parallel execution and coordination concepts, but the skill is primarily about a data processing framework rather than general concurrency patterns.",
"with_dim_id": "concurrency-and-parallel-processing",
"with_dim_name": null,
"with_role": "Backend Engineer"
},
{
"reason": "Flink tuning often involves throughput, latency, and state backend optimization, which can overlap with general performance work.",
"with_dim_id": "performance-and-scalability-tuning",
"with_dim_name": null,
"with_role": "Backend Engineer"
}
],
"tentative_id": "etl-and-elt-tooling"
},
{
"description": "Distributed engines and concepts for processing high-volume event streams with state, fault tolerance, and low latency. Flink fits here because it is widely used as a distributed runtime for continuous data pipelines and real-time analytics.",
"exemplar_skills": [
"Flink",
"distributed stream processing",
"exactly-once processing",
"event-time semantics",
"backpressure",
"checkpointing",
"stateful operators",
"watermarks"
],
"in_scope": "Flink, distributed stream processing, event-driven pipelines, stateful operators, fault tolerance, exactly-once processing, event-time semantics, watermarks, backpressure, checkpointing, parallel stream execution",
"name": "Distributed Stream Processing",
"out_of_scope": "General-purpose message brokers and queues, which belong to messaging infrastructure; warehouse ELT tools, which belong to ETL and ELT tooling; application-level concurrency primitives, which belong to programming and parallel processing",
"overlap_flags": [
{
"reason": "Many teams use Flink as an ETL/ELT engine, so the boundary between pipeline tooling and stream-processing architecture can be blurred.",
"with_dim_id": "etl-and-elt-tooling",
"with_dim_name": null,
"with_role": "Data Engineer"
},
{
"reason": "Flink\u0027s execution model relies on parallelism, but this dimension focuses on distributed dataflow rather than generic concurrency techniques.",
"with_dim_id": "concurrency-and-parallel-processing",
"with_dim_name": null,
"with_role": "Backend Engineer"
}
],
"tentative_id": "d_init_01"
}
],
"merge_log": [],
"placed": {
"name": "Flink",
"placement_confidence": 0.92,
"primary_dimension": "etl-and-elt-tooling",
"reasoning": "Deterministic JD placement: locked_dimensions has 2 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"d_init_01"
],
"skill_id": "flink"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"databricks",
"splunk",
"nosql",
"scrum",
"mlops",
"langchain",
"kotlin"
],
"requires": [],
"skill_id": "flink",
"suppress_on_match": []
},
"skill_id": "flink",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.9,
"name": "Flink",
"reasoning": "Flink is fundamentally a structured distributed processing framework that developers build stream and batch applications on, rather than a standalone tool they merely operate.",
"skill_id": "flink",
"subtype": "stream_processing_framework",
"type": "Framework"
},
"warnings": []
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "Kafka",
"alias_type": "CANONICAL",
"id": 173,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 9,
"display_name": "Kafka",
"id": 36,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "kafka",
"sub_category_id": 47,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 8,
"rationale": "Transport-layer systems used to move events and decouple producers from consumers. Data engineers use these systems to ingest, buffer, and distribute event data before downstream processing.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Kafka",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A Backend Engineer designs, builds, and maintains the server-side logic and data handling that power applications and services. They focus on implementing reliable business functionality, integrating with other systems, and ensuring the backend is scalable, maintainable, and observable.",
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
}
],
"input_skill": "Kafka",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "Airflow",
"alias_type": "CANONICAL",
"id": 526,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 13,
"display_name": "Airflow",
"id": 265,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "TOOL",
"slug": "airflow",
"sub_category_id": 130,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Workflow Orchestration for ML Pipelines",
"id": 54,
"rationale": "Workflow engines used to coordinate training, evaluation, deployment, and retraining jobs. This cluster covers dependencies, retries, scheduling, and pipeline composition for ML lifecycle automation.",
"slug": "workflow-orchestration-for-ml-pipelines",
"source": "db"
},
"input_skill": "Airflow",
"llm_role": null,
"roles_from_db": [
{
"display_name": "ML Engineer",
"id": 3,
"rationale": null,
"role_archetype": null,
"slug": "ml-engineer",
"source": "db"
}
]
}
],
"input_skill": "Airflow",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
}
],
"unmatched_skills": [
"Spark",
"Flink"
]
}
API 3 — final-role-output
{
"chosen_role": {
"display_name": "Data Engineer",
"id": 2,
"rationale": "The primary skills require expertise in data processing and orchestration tools, fitting the Data Engineer role well.",
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
},
"chosen_role_resolution": "in_db",
"final_input_skills": [
{
"skill": "Spark",
"tag": "new"
},
{
"skill": "Flink",
"tag": "new"
},
{
"skill": "Kafka",
"tag": "in_db"
},
{
"skill": "Airflow",
"tag": "in_db"
}
],
"persistence": {
"items": [
{
"chosen_role_id": 2,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 8,
"rationale": "Transport-layer systems used to move events and decouple producers from consumers. Data engineers use these systems to ingest, buffer, and distribute event data before downstream processing.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"dimension_id": 8,
"input_skill": "Kafka",
"llm_role": null,
"matched_chosen_role": true,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
"role_dimension_saved": true,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A Backend Engineer designs, builds, and maintains the server-side logic and data handling that power applications and services. They focus on implementing reliable business functionality, integrating with other systems, and ensuring the backend is scalable, maintainable, and observable.",
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 36,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 2,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Workflow Orchestration for ML Pipelines",
"id": 54,
"rationale": "Workflow engines used to coordinate training, evaluation, deployment, and retraining jobs. This cluster covers dependencies, retries, scheduling, and pipeline composition for ML lifecycle automation.",
"slug": "workflow-orchestration-for-ml-pipelines",
"source": "db"
},
"dimension_id": 54,
"input_skill": "Airflow",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "ML Engineer",
"id": 3,
"rationale": null,
"role_archetype": null,
"slug": "ml-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 265,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 2,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "React Frontend Development",
"id": 96,
"rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 96,
"input_skill": "Spark",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 1348,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 2,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Systems Programming",
"id": 166,
"rationale": "Systems programming covers low-level software development where performance, memory safety, and direct control over resources matter. Rust fits here because it is commonly used for OS-adjacent services, infrastructure components, and other performance-sensitive systems code.",
"slug": "d_init_02",
"source": "db"
},
"dimension_id": 166,
"input_skill": "Spark",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 1348,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 2,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "ETL and ELT Tooling",
"id": 24,
"rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
"slug": "etl-and-elt-tooling",
"source": "db"
},
"dimension_id": 24,
"input_skill": "Flink",
"llm_role": null,
"matched_chosen_role": true,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension saved",
"role_dimension_saved": true,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 2,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 1349,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 2,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "React Frontend Development",
"id": 96,
"rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 96,
"input_skill": "Flink",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 1349,
"skill_tag": "in_db",
"skipped_reason": null
}
],
"new_skills_created": 2,
"role_dimension_saved": 0,
"skill_dimension_saved": 4,
"skipped": 0
},
"planner_output": null,
"run_id": "d1284c9b-3959-4f53-b9f9-09085e1072b9"
}
LLM Calls
Every model call made for this run, in pipeline order. Click a card to see the model's response.