Pipeline run

5c415ce7-e9d4-4ca3-97c6-e28132bfcdbe

Pipeline LLM cost (USD)

API 1: $0.0035 API 2: $0.0002 API 3: $0.0000 Total: $0.0038

Client output enrichment

v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA description

role baseline loaded sources · ai_index: jd · nature_of_work: jd · tech_stack_maturity: jd

Nature of work · Data pipeline development

Operate Airflow DAGs and SQL/Spark pipelines to migrate and deprecate workflows, backfill and validate data, and keep data processing jobs reliable, performant, and version-controlled in Git.

""Use Apache Airflow to schedule, monitor, and automate data workflows.""

Tech stack maturity

Mainstream Modern

Apache Airflow, Apache Spark, Git, and SQL are widely adopted, current data engineering tools that fit a mainstream modern stack rather than legacy or bleeding-edge.

AI index (0 = no AI use, 5 = totally AI-dependent · v2.1)

0.00 / 5

· Title match

· Has AI skill

· AI skill (primary)

· AI skill (secondary)

· On AI team

· Builds AI products

vocab breakdown (legacy)

Assistants (×1): —

Frameworks (×2): —

Models / concepts (×3): —

Evidence — skills matched in JD (10)

Apache Airflow DAGs SQL Apache Spark Git Data Migration Data Validation Anomaly Detection Data Governance Data Security

Skill cluster (5 dimension groups, role-scoped)

Data Pipeline Orchestration

Apache Airflow

Data Quality and Reconciliation

Anomaly Detection

ETL and ELT Tooling

Apache Spark

Programming Languages for Data Work

SQL

Cross-cutting / unaligned

DAGs Git Data Migration Data Validation Data Governance Data Security

Show KRA description ↓

1. Workflow Deprecation Plan and execute the deprecation of migrated workflows by evaluating current workflows' dependencies and consumption. Utilize tools and best practices to identify, mark, and communicate deprecated workflows to stakeholders. 2. Data Migration Plan and execute data migration tasks to move data between different storage systems or formats. Ensure the accuracy and completeness of data during migration processes. Implement strategies to accelerate the pace of data migration by backfilling, validating, and making new data assets ready for use. 3. Data Validation Define and implement data validation rules to ensure data accuracy, completeness, and reliability. Utilize data validation solutions and anomaly detection methods to monitor data quality. 4. Workflow Management Use Apache Airflow to schedule, monitor, and automate data workflows. Develop and manage DAGs (Directed Acyclic Graphs) in Airflow to orchestrate complex data processing tasks. 5. Data Processing Develop and maintain data processing scripts using SQL and Apache Spark. Optimize data processing for performance and efficiency. 6. Version Control Use Git for version control, collaborating with the team to manage the codebase and track changes. Ensure best practices in code quality and repository management. 7. Continuous Improvement Keep up to date with the latest developments in data engineering and related technologies. Continuously improve and refactor data pipelines, tooling, and processes to enhance performance and reliability.  Proficient in Git for version control and collaborative development.  Proficiency in SQL and experience with database technologies.  Experience in data pipeline tools such as Apache Airflow.  Strong knowledge of Apache Spark for data processing and transformation.  Experience with data migration and validation techniques.  Knowledge of data governance and security practices.  Strong problem-solving skills and the ability to work independently and in a team.  Ability to communicate with global team  Ability to work as a team in high performing environment.

Signals

Skill data-engineer

0.43

Alias data-engineer

1.00

KRA data-engineer

0.72

Post-classification

Centroidupdated · n=164

Alias collision log—

New-role queue—

New skills captured5

New KRA captured—

Captured for admin review

DAGs primary ↔ Data Engineer pending

Data Migration primary ↔ Data Engineer pending

Data Validation primary ↔ Data Engineer pending

Data Governance ↔ Data Engineer pending

Data Security ↔ Data Engineer pending

Status: completed Created: 2026-05-27T14:22:49.498017Z Updated: 2026-05-27T14:24:01.260927Z API 3 duration: 19343 ms

Flow Current 3-step pipeline

1 POST /skills/extract-from-jd

2 POST /skills/extract-details

3 POST /skills/final-role-output

Role Chosen role & resolution

Data Engineer

CASE A

slug: data-engineer · id: 2 · source: db

Exact alias hit on data-engineer (1.0) — no other alias at this confidence; skill_top data-engineer 0.43 does not contradict

Resolution: in_db — role exists in library; skill↔dim and role↔dim links saved when applicable.

New skills

Skill↔dim saved

Role↔dim saved

Skipped

Job description

JD


Data Engineer II Job Desc Deprecation Accelerator scope
We are looking for a Data Engineer who has working knowledge of building and
maintaining scalable data pipelines on-premises and on the cloud. This includes
understanding the input and output data sources, upstream downstream dependencies
and ensuring data quality. A key aspect of this role will be focusing on the deprecation
of migrated workflows and migration of workflows into new systems (if needed). The
ideal candidate should be experienced with tools and technologies such as Git, Apache
Airflow, Apache Spark, SQL, data migration, and data validation.
Key Responsibilities:
1. Workflow Deprecation


o Plan and execute the deprecation of migrated workflows by
evaluating current workflows&#39; dependencies and consumption.
o Utilize tools and best practices to identify, mark, and communicate
deprecated workflows to stakeholders.


2. Data Migration


o Plan and execute data migration tasks to move data between
different storage systems or formats.
o Ensure the accuracy and completeness of data during migration
processes.
o Implement strategies to accelerate the pace of data migration by
backfilling, validating, and making new data assets ready for use.


3. Data Validation


o Define and implement data validation rules to ensure data
accuracy, completeness, and reliability.
o Utilize data validation solutions and anomaly detection methods to
monitor data quality.
4. Workflow Management


o Use Apache Airflow to schedule, monitor, and automate data
workflows.
o Develop and manage DAGs (Directed Acyclic Graphs) in Airflow
to orchestrate complex data processing tasks.


5. Data Processing


o Develop and maintain data processing scripts using SQL and
Apache Spark.
o Optimize data processing for performance and efficiency.


6. Version Control


o Use Git for version control, collaborating with the team to manage
the codebase and track changes.
o Ensure best practices in code quality and repository management.


7. Continuous Improvement


o Keep up to date with the latest developments in data engineering
and related technologies.


o Continuously improve and refactor data pipelines, tooling, and
processes to enhance performance and reliability.


Skills and Qualifications:
 Bachelor&#39;s degree in Computer Science, Engineering, or a related field.
 Proficient in Git for version control and collaborative development.
 Proficiency in SQL and experience with database technologies.
 Experience in data pipeline tools such as Apache Airflow.
 Strong knowledge of Apache Spark for data processing and transformation.
 Experience with data migration and validation techniques.
 Knowledge of data governance and security practices.
 Strong problem-solving skills and the ability to work independently and in a
team.
 Ability to communicate with global team
 Ability to work as a team in high performing environment.

Skills from this JD

Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.

Apache Airflow Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)

Canonical: Apache Airflow id=110 · apache-airflow

Aliases — catalog

Apache Airflow (CANONICAL) primary

Context tags (catalog)

CeleryExecutor DAG ETL KubernetesExecutor Sensors XCom backfill catchup cron data pipelines executor hooks operators scheduler task dependencies

Stored enrichment (catalog DB)

Category: Tool
Sub-category: Workflow Orchestration Tool
Vendor: Apache Software Foundation
License: apache_2
Year introduced: 2015
Confidence: 0.98
Version strategy: NOT_APPLICABLE

Maturity reasoning: Frequently listed in data engineering JDs and widely adopted for workflow orchestration; strong GitHub activity and managed offerings from AWS/GCP/Azure signal broad market demand.

Skill profile (library / DB)

Skill nature: TOOL
Volatility: STABLE
Typical lifespan: EVERGREEN
Category id: 13
Sub-category id: 130
Extractable: True
Also category: False

Dimensions (API 2 worklist)

Data Pipeline Orchestration Catalog dimension db id 23

Library dimension (catalog)

Roles linked in library: Data Engineer

API 3 link attempts (this skill)

Dimension	Skill↔dim	Role↔dim	Outcome
Data Pipeline Orchestration data-pipeline-orchestration	✓	✓	Existing dimension (library) · Role↔dimension saved

DAGs Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields

Category: Data Engineering Tools
Sub-category: general
Skill nature: CONCEPT
Volatility: MEDIUM
Typical lifespan: MULTI_YEAR
Version strategy: UNVERSIONED

SQL Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)

Canonical: SQL id=101 · sql

Aliases — catalog

SQL (CANONICAL) primary

Context tags (catalog)

ACID CTE DDL DML ETL JOIN MySQL NoSQL OLAP ORM PostgreSQL SQL injection SQLite T-SQL data modeling data warehousing database normalization execution plan indexing joins normalization query optimization stored procedures subquery transaction isolation transaction management window functions

Stored enrichment (catalog DB)

Category: Language
Sub-category: Query Language
Vendor: ANSI
License: unknown
Year introduced: 1974
Confidence: 0.99
Version strategy: NOT_APPLICABLE

Maturity reasoning: SQL appears in a large share of data, backend, and analytics job descriptions and remains the default query language for PostgreSQL, MySQL, and cloud warehouses like Snowflake/BigQuery.

Skill profile (library / DB)

Skill nature: LANGUAGE
Volatility: STABLE
Typical lifespan: EVERGREEN
Category id: 6
Sub-category id: 97
Extractable: True
Also category: False

Dimensions (API 2 worklist)

Pega Programming Languages & DSLs Catalog dimension db id 267

Library dimension (catalog)

Roles linked in library: Pega Developer
Programming Languages for Data Work Catalog dimension db id 21

Library dimension (catalog)

Roles linked in library: Data Engineer

API 3 link attempts (this skill)

Dimension	Skill↔dim	Role↔dim	Outcome
Pega Programming Languages & DSLs pega-programming-languages-dsls	✓	—	Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
Programming Languages for Data Work programming-languages-for-data-work	✓	✓	Existing dimension (library) · Role↔dimension saved

Apache Spark Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)

Canonical: Apache Spark id=1350 · apache-spark

Aliases — catalog

Apache Spark (CANONICAL)
apache spark 3 (VERSION)
spark (VERSION)
spark 3 (VERSION)
spark 3.x (VERSION)
spark3 (VERSION)

Context tags (catalog)

Apache Kafka Cluster Manager DAGScheduler Data Lake DataFrame ETL Hadoop MLlib Machine Learning PySpark RDD Scala Spark SQL Spark Streaming SparkSession

Stored enrichment (catalog DB)

Category: Framework
Sub-category: Distributed Data Processing Framework
Vendor: Apache Software Foundation
License: apache_2
Year introduced: 2010
Confidence: 0.94
Version strategy: SEPARATE_ENTITY
Version tag: 3.x

Maturity reasoning: Apache Spark appears in many data engineering JDs and remains a standard for distributed ETL/ELT; its GitHub and vendor ecosystem activity stay strong, with Databricks and cloud platforms still promoting it.

Skill profile (library / DB)

Skill nature: FRAMEWORK
Volatility: STABLE
Typical lifespan: EVERGREEN
Category id: 5
Sub-category id: 1021
Extractable: True
Also category: False

Dimensions (API 2 worklist)

ETL and ELT Tooling Catalog dimension db id 24

Library dimension (catalog)

Roles linked in library: Data Engineer

API 3 link attempts (this skill)

Dimension	Skill↔dim	Role↔dim	Outcome
ETL and ELT Tooling etl-and-elt-tooling	✓	✓	Existing dimension (library) · Role↔dimension saved

Git Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)

Canonical: Git id=1002 · git

Aliases — catalog

Git (CANONICAL)

Context tags (catalog)

CI/CD GitHub GitLab branching checkout clone commit fork merging pull request rebase remote repository stash versioning

Stored enrichment (catalog DB)

Category: Tool
Sub-category: Version Control Tool
Vendor: Linus Torvalds
License: gpl_v2
Year introduced: 2005
Confidence: 0.99
Version strategy: NOT_APPLICABLE

Maturity reasoning: Git is a hiring-pipeline staple: it appears in the vast majority of software engineering job descriptions and is the default VCS on GitHub/GitLab/Bitbucket.

Skill profile (library / DB)

Skill nature: TOOL
Volatility: STABLE
Typical lifespan: EVERGREEN
Category id: 13
Sub-category id: 730
Extractable: True
Also category: False

Dimensions (API 2 worklist)

React Frontend Development Catalog dimension db id 96

Library dimension (catalog)

API 3 link attempts (this skill)

Dimension	Skill↔dim	Role↔dim	Outcome
React Frontend Development d_init_01	✓	—	Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)

Data Migration Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields

Category: Data Engineering Tools
Sub-category: general
Skill nature: PRACTICE
Volatility: MEDIUM
Typical lifespan: MULTI_YEAR
Version strategy: UNVERSIONED

Data Validation Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields

Category: Data Engineering Tools
Sub-category: general
Skill nature: PRACTICE
Volatility: MEDIUM
Typical lifespan: MULTI_YEAR
Version strategy: UNVERSIONED

Anomaly Detection Secondary Library skill API 3: existing canonical (in_db) Existing skill (matched library)

Canonical: Anomaly detection id=134 · anomaly-detection

Aliases — catalog

Anomaly detection (CANONICAL) primary

Context tags (catalog)

CUSUM EWMA Mahalanobis distance autoencoder automated alerts change point detection control charts data drift density estimation false positives feature engineering isolation forest machine learning model validation monitoring novelty detection one-class SVM outlier outlier detection predictive maintenance real-time analysis root cause analysis seasonality statistical methods thresholding time series unsupervised learning z-score

Stored enrichment (catalog DB)

Category: Concept
Sub-category: Ml Monitoring Concept
Confidence: 0.90
Version strategy: NOT_APPLICABLE

Maturity reasoning: Common in ML/observability job descriptions and vendor docs (Datadog, Splunk, AWS, Azure) for fraud, monitoring, and alerting; broad market adoption across production systems.

Skill profile (library / DB)

Skill nature: CONCEPT
Volatility: STABLE
Typical lifespan: EVERGREEN
Category id: 2
Sub-category id: 1117
Extractable: True
Also category: False

Dimensions (API 2 worklist)

Data Quality and Reconciliation Catalog dimension db id 27

Library dimension (catalog)

Roles linked in library: Data Engineer
Model Monitoring and Drift Detection Catalog dimension db id 45

Library dimension (catalog)

Roles linked in library: ML Engineer, MLOps Engineer

API 3 link attempts (this skill)

Dimension	Skill↔dim	Role↔dim	Outcome
Data Quality and Reconciliation data-quality-and-reconciliation	✓	✓	Existing dimension (library) · Role↔dimension saved
Model Monitoring and Drift Detection model-monitoring-and-drift-detection	✓	—	Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)

Data Governance Secondary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields

Category: Data Engineering Tools
Sub-category: general
Skill nature: CONCEPT
Volatility: MEDIUM
Typical lifespan: MULTI_YEAR
Version strategy: UNVERSIONED

Data Security Secondary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields

Category: Security Tools
Sub-category: general
Skill nature: CONCEPT
Volatility: MEDIUM
Typical lifespan: MULTI_YEAR
Version strategy: UNVERSIONED

All API 3 persistence rows

Same grid as the skill-extractor “Persistence items” table: one row per (skill × dimension) work item.

Skill	Tag	Dimension	Skill↔dim	Role↔dim	Outcome
Apache Airflow	in_db	Data Pipeline Orchestration data-pipeline-orchestration	✓	✓	Existing dimension (library) · Role↔dimension saved
SQL	in_db	Pega Programming Languages & DSLs pega-programming-languages-dsls	✓	—	Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
SQL	in_db	Programming Languages for Data Work programming-languages-for-data-work	✓	✓	Existing dimension (library) · Role↔dimension saved
Apache Spark	in_db	ETL and ELT Tooling etl-and-elt-tooling	✓	✓	Existing dimension (library) · Role↔dimension saved
Git	in_db	React Frontend Development d_init_01	✓	—	Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
Anomaly Detection	in_db	Data Quality and Reconciliation data-quality-and-reconciliation	✓	✓	Existing dimension (library) · Role↔dimension saved
Anomaly Detection	in_db	Model Monitoring and Drift Detection model-monitoring-and-drift-detection	✓	—	Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)

Library artifacts (this run)

Kind	Detail	DB id
canonical_skill_proposed	DAGs \| type=Data Engineering Tools subtype=general nature=CONCEPT lifespan=MULTI_YEAR
canonical_skill_proposed	Data Migration \| type=Data Engineering Tools subtype=general nature=PRACTICE lifespan=MULTI_YEAR
canonical_skill_proposed	Data Validation \| type=Data Engineering Tools subtype=general nature=PRACTICE lifespan=MULTI_YEAR
canonical_skill_proposed	Data Governance \| type=Data Engineering Tools subtype=general nature=CONCEPT lifespan=MULTI_YEAR
canonical_skill_proposed	Data Security \| type=Security Tools subtype=general nature=CONCEPT lifespan=MULTI_YEAR

nano JD Parser — gpt-4.1-nano click to toggle

RoleData Engineer II

DomainOther

JD type pass

Show raw JSON

{
  "JD_type": "pass",
  "about_company": null,
  "certifications": [],
  "company_name": null,
  "ctc": null,
  "domain": {
    "primary": {
      "aliases": [],
      "domain": "Other"
    },
    "secondary": null
  },
  "education": [
    {
      "level": "Bachelor\u0027s",
      "qualification": "BTECH/BE - Computer Science (or related)",
      "raw": "Bachelor\u0027s degree in Computer Science, Engineering, or a related field.",
      "requirement": "required"
    }
  ],
  "experience": null,
  "job_locations": [],
  "role": "Data Engineer II",
  "role_aliases": [
    "Data Engineer",
    "Data Engineer II",
    "Data Pipeline Engineer"
  ],
  "role_archetype": "Data",
  "roles_and_responsibilities": [
    {
      "bullet_count": 7,
      "heading": "Key Responsibilities",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "1. Workflow Deprecation",
        "last_5_words": "and reliability."
      },
      "text": "1. Workflow Deprecation\n\n\nPlan and execute the deprecation of migrated workflows by evaluating current workflows\u0027 dependencies and consumption.\nUtilize tools and best practices to identify, mark, and communicate deprecated workflows to stakeholders.\n\n\n2. Data Migration\n\n\nPlan and execute data migration tasks to move data between different storage systems or formats.\nEnsure the accuracy and completeness of data during migration processes.\nImplement strategies to accelerate the pace of data migration by backfilling, validating, and making new data assets ready for use.\n\n\n3. Data Validation\n\n\nDefine and implement data validation rules to ensure data accuracy, completeness, and reliability.\nUtilize data validation solutions and anomaly detection methods to monitor data quality.\n\n\n4. Workflow Management\n\n\nUse Apache Airflow to schedule, monitor, and automate data workflows.\nDevelop and manage DAGs (Directed Acyclic Graphs) in Airflow to orchestrate complex data processing tasks.\n\n\n5. Data Processing\n\n\nDevelop and maintain data processing scripts using SQL and Apache Spark.\nOptimize data processing for performance and efficiency.\n\n\n6. Version Control\n\n\nUse Git for version control, collaborating with the team to manage the codebase and track changes.\nEnsure best practices in code quality and repository management.\n\n\n7. Continuous Improvement\n\n\nKeep up to date with the latest developments in data engineering and related technologies.\nContinuously improve and refactor data pipelines, tooling, and processes to enhance performance and reliability.",
      "word_count": 366
    },
    {
      "bullet_count": 9,
      "heading": "Skills and Qualifications",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "Proficient in Git for version",
        "last_5_words": "high performing environment."
      },
      "text": "\uf0b7 Proficient in Git for version control and collaborative development.\n\uf0b7 Proficiency in SQL and experience with database technologies.\n\uf0b7 Experience in data pipeline tools such as Apache Airflow.\n\uf0b7 Strong knowledge of Apache Spark for data processing and transformation.\n\uf0b7 Experience with data migration and validation techniques.\n\uf0b7 Knowledge of data governance and security practices.\n\uf0b7 Strong problem-solving skills and the ability to work independently and in a team.\n\uf0b7 Ability to communicate with global team\n\uf0b7 Ability to work as a team in high performing environment.",
      "word_count": 81
    }
  ],
  "urls": []
}

API 1 — extract-from-jd click to toggle

{
  "final_skills": [
    {
      "is_primary": true,
      "skill_name": "Apache Airflow"
    },
    {
      "is_primary": true,
      "skill_name": "DAGs"
    },
    {
      "is_primary": true,
      "skill_name": "SQL"
    },
    {
      "is_primary": true,
      "skill_name": "Apache Spark"
    },
    {
      "is_primary": true,
      "skill_name": "Git"
    },
    {
      "is_primary": true,
      "skill_name": "Data Migration"
    },
    {
      "is_primary": true,
      "skill_name": "Data Validation"
    },
    {
      "is_primary": false,
      "skill_name": "Anomaly Detection"
    },
    {
      "is_primary": false,
      "skill_name": "Data Governance"
    },
    {
      "is_primary": false,
      "skill_name": "Data Security"
    }
  ],
  "jd_role": {
    "display_name": "Data Engineer II",
    "rationale": null,
    "role_aliases": [
      "Data Engineer",
      "Data Engineer II",
      "Data Pipeline Engineer"
    ],
    "role_archetype": "Data",
    "slug": ""
  },
  "nano_parsed": {
    "JD_type": "pass",
    "about_company": null,
    "certifications": [],
    "company_name": null,
    "ctc": null,
    "domain": {
      "primary": {
        "aliases": [],
        "domain": "Other"
      },
      "secondary": null
    },
    "education": [
      {
        "level": "Bachelor\u0027s",
        "qualification": "BTECH/BE - Computer Science (or related)",
        "raw": "Bachelor\u0027s degree in Computer Science, Engineering, or a related field.",
        "requirement": "required"
      }
    ],
    "experience": null,
    "job_locations": [],
    "role": "Data Engineer II",
    "role_aliases": [
      "Data Engineer",
      "Data Engineer II",
      "Data Pipeline Engineer"
    ],
    "role_archetype": "Data",
    "roles_and_responsibilities": [
      {
        "bullet_count": 7,
        "heading": "Key Responsibilities",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "1. Workflow Deprecation",
          "last_5_words": "and reliability."
        },
        "text": "1. Workflow Deprecation\n\n\nPlan and execute the deprecation of migrated workflows by evaluating current workflows\u0027 dependencies and consumption.\nUtilize tools and best practices to identify, mark, and communicate deprecated workflows to stakeholders.\n\n\n2. Data Migration\n\n\nPlan and execute data migration tasks to move data between different storage systems or formats.\nEnsure the accuracy and completeness of data during migration processes.\nImplement strategies to accelerate the pace of data migration by backfilling, validating, and making new data assets ready for use.\n\n\n3. Data Validation\n\n\nDefine and implement data validation rules to ensure data accuracy, completeness, and reliability.\nUtilize data validation solutions and anomaly detection methods to monitor data quality.\n\n\n4. Workflow Management\n\n\nUse Apache Airflow to schedule, monitor, and automate data workflows.\nDevelop and manage DAGs (Directed Acyclic Graphs) in Airflow to orchestrate complex data processing tasks.\n\n\n5. Data Processing\n\n\nDevelop and maintain data processing scripts using SQL and Apache Spark.\nOptimize data processing for performance and efficiency.\n\n\n6. Version Control\n\n\nUse Git for version control, collaborating with the team to manage the codebase and track changes.\nEnsure best practices in code quality and repository management.\n\n\n7. Continuous Improvement\n\n\nKeep up to date with the latest developments in data engineering and related technologies.\nContinuously improve and refactor data pipelines, tooling, and processes to enhance performance and reliability.",
        "word_count": 366
      },
      {
        "bullet_count": 9,
        "heading": "Skills and Qualifications",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "Proficient in Git for version",
          "last_5_words": "high performing environment."
        },
        "text": "\uf0b7 Proficient in Git for version control and collaborative development.\n\uf0b7 Proficiency in SQL and experience with database technologies.\n\uf0b7 Experience in data pipeline tools such as Apache Airflow.\n\uf0b7 Strong knowledge of Apache Spark for data processing and transformation.\n\uf0b7 Experience with data migration and validation techniques.\n\uf0b7 Knowledge of data governance and security practices.\n\uf0b7 Strong problem-solving skills and the ability to work independently and in a team.\n\uf0b7 Ability to communicate with global team\n\uf0b7 Ability to work as a team in high performing environment.",
        "word_count": 81
      }
    ],
    "urls": []
  },
  "rejected": false,
  "rejection_reason": null,
  "run_id": "5c415ce7-e9d4-4ca3-97c6-e28132bfcdbe",
  "stage3_signals": {
    "alias_found": true,
    "alias_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": null,
        "matched_count": null,
        "matched_skills": null,
        "role_id": 2,
        "score": 1.0,
        "slug": "data-engineer",
        "total_count": null
      }
    ],
    "kra_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": [
          {
            "kra_text": "Implements data quality validation rules, reconciliation checks, and anomaly detection to ensure data completeness, accuracy, and consistency.",
            "sentence": "Define and implement data validation rules to ensure data accuracy, completeness, and reliability.",
            "similarity": 0.7516
          },
          {
            "kra_text": "Implements data quality validation rules, reconciliation checks, and anomaly detection to ensure data completeness, accuracy, and consistency.",
            "sentence": "Utilize data validation solutions and anomaly detection methods to monitor data quality.",
            "similarity": 0.7061
          },
          {
            "kra_text": "Develops batch and real-time streaming data pipelines using Apache Spark, Apache Kafka, Apache Flink, or Airflow for data movement and processing at scale.",
            "sentence": "\uf0b7 Experience in data pipeline tools such as Apache Airflow.",
            "similarity": 0.6936
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 2,
        "score": 0.7171,
        "slug": "data-engineer",
        "total_count": null
      },
      {
        "display_name": "React Native Developer",
        "kra_matches": [
          {
            "kra_text": "maintain code quality",
            "sentence": "Ensure best practices in code quality and repository management.",
            "similarity": 0.7165
          },
          {
            "kra_text": "support offline-aware data flow",
            "sentence": "Implement strategies to accelerate the pace of data migration by backfilling, validating, and making new data assets ready for use.",
            "similarity": 0.4426
          },
          {
            "kra_text": "maintain code quality",
            "sentence": "Continuously improve and refactor data pipelines, tooling, and processes to enhance performance and reliability.",
            "similarity": 0.4355
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 73,
        "score": 0.5315,
        "slug": "react-native-developer",
        "total_count": null
      },
      {
        "display_name": "Fullstack Developer",
        "kra_matches": [
          {
            "kra_text": "Optimizes application performance from database query efficiency through API response latency to frontend rendering speed and bundle size.",
            "sentence": "Optimize data processing for performance and efficiency.",
            "similarity": 0.5783
          },
          {
            "kra_text": "Delivers features through CI/CD pipelines using automated tests, staged rollouts, feature flags, and incremental deployments.",
            "sentence": "Continuously improve and refactor data pipelines, tooling, and processes to enhance performance and reliability.",
            "similarity": 0.5286
          },
          {
            "kra_text": "Designs and queries relational databases like PostgreSQL and document stores like MongoDB, writing migrations, indexes, and optimized queries.",
            "sentence": "\uf0b7 Proficiency in SQL and experience with database technologies.",
            "similarity": 0.4813
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 15,
        "score": 0.5294,
        "slug": "full-stack-engineer",
        "total_count": null
      },
      {
        "display_name": "Java Backend Developer",
        "kra_matches": [
          {
            "kra_text": "backend performance tuning",
            "sentence": "Optimize data processing for performance and efficiency.",
            "similarity": 0.5833
          },
          {
            "kra_text": "code refactoring and defect fixes",
            "sentence": "Ensure best practices in code quality and repository management.",
            "similarity": 0.5049
          },
          {
            "kra_text": "request validation and error handling",
            "sentence": "Define and implement data validation rules to ensure data accuracy, completeness, and reliability.",
            "similarity": 0.4997
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 79,
        "score": 0.5293,
        "slug": "java-backend-developer",
        "total_count": null
      },
      {
        "display_name": "Scala Backend Developer",
        "kra_matches": [
          {
            "kra_text": "business rule and validation logic",
            "sentence": "Define and implement data validation rules to ensure data accuracy, completeness, and reliability.",
            "similarity": 0.539
          },
          {
            "kra_text": "performance and reliability tuning",
            "sentence": "Optimize data processing for performance and efficiency.",
            "similarity": 0.5095
          },
          {
            "kra_text": "backend workflow orchestration",
            "sentence": "Develop and manage DAGs (Directed Acyclic Graphs) in Airflow to orchestrate complex data processing tasks.",
            "similarity": 0.4915
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 87,
        "score": 0.5133,
        "slug": "scala-backend-developer",
        "total_count": null
      }
    ],
    "skill_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": null,
        "matched_count": 3,
        "matched_skills": [
          "Apache Airflow",
          "Apache Spark",
          "SQL"
        ],
        "role_id": 2,
        "score": 0.4286,
        "slug": "data-engineer",
        "total_count": 7
      },
      {
        "display_name": "Pega Developer",
        "kra_matches": null,
        "matched_count": 1,
        "matched_skills": [
          "SQL"
        ],
        "role_id": 24,
        "score": 0.1429,
        "slug": "pega-developer",
        "total_count": 7
      }
    ]
  },
  "stage4_decision": {
    "alias_collision_detected": false,
    "case": "A",
    "chosen_role": {
      "display_name": "Data Engineer",
      "kra_matches": null,
      "matched_count": null,
      "matched_skills": null,
      "role_id": 2,
      "score": 1.0,
      "slug": "data-engineer",
      "total_count": null
    },
    "confidence": 1.0,
    "is_new_role": false,
    "llm2_fired": false,
    "llm2_reasoning": null,
    "matched_dimensions": [],
    "matched_kras": [],
    "matched_skills": [],
    "new_role_display_name": null,
    "new_role_slug": null,
    "queued": false,
    "reasoning": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.43 does not contradict",
    "sub_role": null
  },
  "stage5_updates": {
    "centroid_n_after": 164,
    "centroid_updated": true,
    "collision_log_id": null,
    "new_kra_attached": null,
    "new_skills_attached": [
      {
        "is_primary": true,
        "queue_id": 8628,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "DAGs",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 8629,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Migration",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 8630,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Validation",
        "status": "pending"
      },
      {
        "is_primary": false,
        "queue_id": 8631,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Governance",
        "status": "pending"
      },
      {
        "is_primary": false,
        "queue_id": 8632,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Security",
        "status": "pending"
      }
    ],
    "queue_entry_id": null,
    "v3_pipeline_triggered": false,
    "v3_role_slug": null,
    "v3_run_id": null
  }
}

API 2 — extract-details

{
  "alias_matches": [
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 304,
      "existing_alias_text": "Apache Airflow",
      "input_term": "Apache Airflow",
      "matched_canonical": {
        "category_id": 13,
        "display_name": "Apache Airflow",
        "id": 110,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "TOOL",
        "slug": "apache-airflow",
        "sub_category_id": 130,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 271,
      "existing_alias_text": "SQL",
      "input_term": "SQL",
      "matched_canonical": {
        "category_id": 6,
        "display_name": "SQL",
        "id": 101,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "LANGUAGE",
        "slug": "sql",
        "sub_category_id": 97,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 2004,
      "existing_alias_text": "Apache Spark",
      "input_term": "Apache Spark",
      "matched_canonical": {
        "category_id": 5,
        "display_name": "Apache Spark",
        "id": 1350,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "FRAMEWORK",
        "slug": "apache-spark",
        "sub_category_id": 1021,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 1613,
      "existing_alias_text": "Git",
      "input_term": "Git",
      "matched_canonical": {
        "category_id": 13,
        "display_name": "Git",
        "id": 1002,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "TOOL",
        "slug": "git",
        "sub_category_id": 730,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 338,
      "existing_alias_text": "Anomaly detection",
      "input_term": "Anomaly Detection",
      "matched_canonical": {
        "category_id": 2,
        "display_name": "Anomaly detection",
        "id": 134,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "CONCEPT",
        "slug": "anomaly-detection",
        "sub_category_id": 1117,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    }
  ],
  "candidate_roles": [
    {
      "display_name": "Data Engineer",
      "id": 2,
      "rationale": null,
      "role_archetype": null,
      "slug": "data-engineer",
      "source": "db"
    },
    {
      "display_name": "Pega Developer",
      "id": 24,
      "rationale": null,
      "role_archetype": null,
      "slug": "pega-developer",
      "source": "db"
    },
    {
      "display_name": "ML Engineer",
      "id": 3,
      "rationale": null,
      "role_archetype": null,
      "slug": "ml-engineer",
      "source": "db"
    },
    {
      "display_name": "MLOps Engineer",
      "id": 16,
      "rationale": null,
      "role_archetype": null,
      "slug": "ml-ops-engineer",
      "source": "db"
    }
  ],
  "chosen_role": {
    "display_name": "Data Engineer",
    "id": 2,
    "rationale": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.43 does not contradict",
    "role_archetype": null,
    "slug": "data-engineer",
    "source": "db"
  },
  "dimensions": [
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Data Pipeline Orchestration",
        "id": 23,
        "rationale": "Workflow engines that schedule, coordinate, and recover batch data jobs. This cluster covers dependency management, retries, backfills, sensors, and operational control of pipeline DAGs.",
        "slug": "data-pipeline-orchestration",
        "source": "db"
      },
      "input_skill": "Apache Airflow",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Pega Programming Languages \u0026 DSLs",
        "id": 267,
        "rationale": "Programming languages and domain-specific languages used in Pega development.",
        "slug": "pega-programming-languages-dsls",
        "source": "db"
      },
      "input_skill": "SQL",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Pega Developer",
          "id": 24,
          "rationale": null,
          "role_archetype": null,
          "slug": "pega-developer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Programming Languages for Data Work",
        "id": 21,
        "rationale": "Languages used to implement data pipelines, transformations, and operational glue. This is the primary coding surface for building ingestion, enrichment, and automation logic in data engineering.",
        "slug": "programming-languages-for-data-work",
        "source": "db"
      },
      "input_skill": "SQL",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "ETL and ELT Tooling",
        "id": 24,
        "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
        "slug": "etl-and-elt-tooling",
        "source": "db"
      },
      "input_skill": "Apache Spark",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "React Frontend Development",
        "id": 96,
        "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
        "slug": "d_init_01",
        "source": "db"
      },
      "input_skill": "Git",
      "llm_role": null,
      "roles_from_db": []
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Data Quality and Reconciliation",
        "id": 27,
        "rationale": "Validation and reconciliation practices that ensure data is accurate, complete, and trustworthy. This includes rule-based checks, anomaly detection, cross-system reconciliation, and failure triage.",
        "slug": "data-quality-and-reconciliation",
        "source": "db"
      },
      "input_skill": "Anomaly Detection",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Model Monitoring and Drift Detection",
        "id": 45,
        "rationale": "Production observability for model behavior, data drift, concept drift, latency, and quality regressions. ML engineers use this to detect degradation and trigger remediation or retraining.",
        "slug": "model-monitoring-and-drift-detection",
        "source": "db"
      },
      "input_skill": "Anomaly Detection",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "ML Engineer",
          "id": 3,
          "rationale": null,
          "role_archetype": null,
          "slug": "ml-engineer",
          "source": "db"
        },
        {
          "display_name": "MLOps Engineer",
          "id": 16,
          "rationale": null,
          "role_archetype": null,
          "slug": "ml-ops-engineer",
          "source": "db"
        }
      ]
    }
  ],
  "input_final_skills": [
    "Apache Airflow",
    "DAGs",
    "SQL",
    "Apache Spark",
    "Git",
    "Data Migration",
    "Data Validation",
    "Anomaly Detection",
    "Data Governance",
    "Data Security"
  ],
  "input_llm_skills": [
    "Apache Airflow",
    "DAGs",
    "SQL",
    "Apache Spark",
    "Git",
    "Data Migration",
    "Data Validation",
    "Anomaly Detection",
    "Data Governance",
    "Data Security"
  ],
  "new_aliases_persisted": 0,
  "run_id": "5c415ce7-e9d4-4ca3-97c6-e28132bfcdbe",
  "skills_detail": [
    {
      "aliases_in_db": [
        {
          "alias_text": "Apache Airflow",
          "alias_type": "CANONICAL",
          "id": 304,
          "is_primary": true,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 13,
        "display_name": "Apache Airflow",
        "id": 110,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "TOOL",
        "slug": "apache-airflow",
        "sub_category_id": 130,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Data Pipeline Orchestration",
            "id": 23,
            "rationale": "Workflow engines that schedule, coordinate, and recover batch data jobs. This cluster covers dependency management, retries, backfills, sensors, and operational control of pipeline DAGs.",
            "slug": "data-pipeline-orchestration",
            "source": "db"
          },
          "input_skill": "Apache Airflow",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "Apache Airflow",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "DAGs",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "CONCEPT",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "dags",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "SQL",
          "alias_type": "CANONICAL",
          "id": 271,
          "is_primary": true,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 6,
        "display_name": "SQL",
        "id": 101,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "LANGUAGE",
        "slug": "sql",
        "sub_category_id": 97,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Pega Programming Languages \u0026 DSLs",
            "id": 267,
            "rationale": "Programming languages and domain-specific languages used in Pega development.",
            "slug": "pega-programming-languages-dsls",
            "source": "db"
          },
          "input_skill": "SQL",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Pega Developer",
              "id": 24,
              "rationale": null,
              "role_archetype": null,
              "slug": "pega-developer",
              "source": "db"
            }
          ]
        },
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Programming Languages for Data Work",
            "id": 21,
            "rationale": "Languages used to implement data pipelines, transformations, and operational glue. This is the primary coding surface for building ingestion, enrichment, and automation logic in data engineering.",
            "slug": "programming-languages-for-data-work",
            "source": "db"
          },
          "input_skill": "SQL",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "SQL",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "Apache Spark",
          "alias_type": "CANONICAL",
          "id": 2004,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "apache spark 3",
          "alias_type": "VERSION",
          "id": 2006,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark",
          "alias_type": "VERSION",
          "id": 2510,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark 3",
          "alias_type": "VERSION",
          "id": 2007,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark 3.x",
          "alias_type": "VERSION",
          "id": 2009,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark3",
          "alias_type": "VERSION",
          "id": 2008,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 5,
        "display_name": "Apache Spark",
        "id": 1350,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "FRAMEWORK",
        "slug": "apache-spark",
        "sub_category_id": 1021,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "ETL and ELT Tooling",
            "id": 24,
            "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
            "slug": "etl-and-elt-tooling",
            "source": "db"
          },
          "input_skill": "Apache Spark",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "Apache Spark",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "Git",
          "alias_type": "CANONICAL",
          "id": 1613,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 13,
        "display_name": "Git",
        "id": 1002,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "TOOL",
        "slug": "git",
        "sub_category_id": 730,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "React Frontend Development",
            "id": 96,
            "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
            "slug": "d_init_01",
            "source": "db"
          },
          "input_skill": "Git",
          "llm_role": null,
          "roles_from_db": []
        }
      ],
      "input_skill": "Git",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Data Migration",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "PRACTICE",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "data-migration",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Data Validation",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "PRACTICE",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "data-validation",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "Anomaly detection",
          "alias_type": "CANONICAL",
          "id": 338,
          "is_primary": true,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 2,
        "display_name": "Anomaly detection",
        "id": 134,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "CONCEPT",
        "slug": "anomaly-detection",
        "sub_category_id": 1117,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Data Quality and Reconciliation",
            "id": 27,
            "rationale": "Validation and reconciliation practices that ensure data is accurate, complete, and trustworthy. This includes rule-based checks, anomaly detection, cross-system reconciliation, and failure triage.",
            "slug": "data-quality-and-reconciliation",
            "source": "db"
          },
          "input_skill": "Anomaly Detection",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        },
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Model Monitoring and Drift Detection",
            "id": 45,
            "rationale": "Production observability for model behavior, data drift, concept drift, latency, and quality regressions. ML engineers use this to detect degradation and trigger remediation or retraining.",
            "slug": "model-monitoring-and-drift-detection",
            "source": "db"
          },
          "input_skill": "Anomaly Detection",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "ML Engineer",
              "id": 3,
              "rationale": null,
              "role_archetype": null,
              "slug": "ml-engineer",
              "source": "db"
            },
            {
              "display_name": "MLOps Engineer",
              "id": 16,
              "rationale": null,
              "role_archetype": null,
              "slug": "ml-ops-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "Anomaly Detection",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Data Governance",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "CONCEPT",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "data-governance",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Data Security",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Security Tools",
          "skill_nature": "CONCEPT",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "data-security",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    }
  ],
  "unmatched_skills": [
    "DAGs",
    "Data Migration",
    "Data Validation",
    "Data Governance",
    "Data Security"
  ]
}

API 3 — final-role-output

{
  "chosen_role": {
    "display_name": "Data Engineer",
    "id": 2,
    "rationale": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.43 does not contradict",
    "role_archetype": null,
    "slug": "data-engineer",
    "source": "db"
  },
  "chosen_role_resolution": "in_db",
  "final_input_skills": [
    {
      "skill": "Apache Airflow",
      "tag": "in_db"
    },
    {
      "skill": "DAGs",
      "tag": "new"
    },
    {
      "skill": "SQL",
      "tag": "in_db"
    },
    {
      "skill": "Apache Spark",
      "tag": "in_db"
    },
    {
      "skill": "Git",
      "tag": "in_db"
    },
    {
      "skill": "Data Migration",
      "tag": "new"
    },
    {
      "skill": "Data Validation",
      "tag": "new"
    },
    {
      "skill": "Anomaly Detection",
      "tag": "in_db"
    },
    {
      "skill": "Data Governance",
      "tag": "new"
    },
    {
      "skill": "Data Security",
      "tag": "new"
    }
  ],
  "llm_cost_api1_usd": null,
  "llm_cost_api2_usd": null,
  "llm_cost_api3_usd": null,
  "llm_cost_total_usd": null,
  "persistence": {
    "items": [
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Data Pipeline Orchestration",
          "id": 23,
          "rationale": "Workflow engines that schedule, coordinate, and recover batch data jobs. This cluster covers dependency management, retries, backfills, sensors, and operational control of pipeline DAGs.",
          "slug": "data-pipeline-orchestration",
          "source": "db"
        },
        "dimension_id": 23,
        "input_skill": "Apache Airflow",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
        "role_dimension_saved": true,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 110,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Pega Programming Languages \u0026 DSLs",
          "id": 267,
          "rationale": "Programming languages and domain-specific languages used in Pega development.",
          "slug": "pega-programming-languages-dsls",
          "source": "db"
        },
        "dimension_id": 267,
        "input_skill": "SQL",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
        "role_dimension_saved": false,
        "roles_from_db": [
          {
            "display_name": "Pega Developer",
            "id": 24,
            "rationale": null,
            "role_archetype": null,
            "slug": "pega-developer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 101,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Programming Languages for Data Work",
          "id": 21,
          "rationale": "Languages used to implement data pipelines, transformations, and operational glue. This is the primary coding surface for building ingestion, enrichment, and automation logic in data engineering.",
          "slug": "programming-languages-for-data-work",
          "source": "db"
        },
        "dimension_id": 21,
        "input_skill": "SQL",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
        "role_dimension_saved": true,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 101,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "ETL and ELT Tooling",
          "id": 24,
          "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
          "slug": "etl-and-elt-tooling",
          "source": "db"
        },
        "dimension_id": 24,
        "input_skill": "Apache Spark",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
        "role_dimension_saved": true,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 1350,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "React Frontend Development",
          "id": 96,
          "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
          "slug": "d_init_01",
          "source": "db"
        },
        "dimension_id": 96,
        "input_skill": "Git",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
        "role_dimension_saved": false,
        "roles_from_db": [],
        "skill_dimension_saved": true,
        "skill_id": 1002,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Data Quality and Reconciliation",
          "id": 27,
          "rationale": "Validation and reconciliation practices that ensure data is accurate, complete, and trustworthy. This includes rule-based checks, anomaly detection, cross-system reconciliation, and failure triage.",
          "slug": "data-quality-and-reconciliation",
          "source": "db"
        },
        "dimension_id": 27,
        "input_skill": "Anomaly Detection",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
        "role_dimension_saved": true,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 134,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Model Monitoring and Drift Detection",
          "id": 45,
          "rationale": "Production observability for model behavior, data drift, concept drift, latency, and quality regressions. ML engineers use this to detect degradation and trigger remediation or retraining.",
          "slug": "model-monitoring-and-drift-detection",
          "source": "db"
        },
        "dimension_id": 45,
        "input_skill": "Anomaly Detection",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
        "role_dimension_saved": false,
        "roles_from_db": [
          {
            "display_name": "ML Engineer",
            "id": 3,
            "rationale": null,
            "role_archetype": null,
            "slug": "ml-engineer",
            "source": "db"
          },
          {
            "display_name": "MLOps Engineer",
            "id": 16,
            "rationale": null,
            "role_archetype": null,
            "slug": "ml-ops-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 134,
        "skill_tag": "in_db",
        "skipped_reason": null
      }
    ],
    "new_skills_created": 0,
    "role_dimension_saved": 0,
    "skill_dimension_saved": 0,
    "skipped": 0
  },
  "planner_output": null,
  "run_id": "5c415ce7-e9d4-4ca3-97c6-e28132bfcdbe"
}

LLM Calls

Every model call made for this run, in pipeline order. Click a card to see the model's response.

Loading…