← Back to history

Pipeline run

c9b2985f-96f8-4315-8d64-ebc774498e2d

Pipeline LLM cost (USD)
API 1: $0.0035 API 2: $0.0002 API 3: $0.0000 Total: $0.0038

Client output enrichment

v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA description
role baseline loaded sources · ai_index: jd · nature_of_work: jd · tech_stack_maturity: jd
Nature of work · Data pipeline development
Build and tune Databricks PySpark ETL pipelines and workflows, moving data into lake/warehouse layers, applying Medallion transformations, and optimizing SQL/Spark jobs for scale, cost, and analytics readiness. Also add logging/monitoring, document lineage, and coordinate with analytics teams.
""designing, developing, and optimizing scalable ETL pipelines and data workflows using Databricks and Apache Spark""
Tech stack maturity
Modern Cloud Native
Apache Spark and Databricks are cloud-first big data technologies, and SQL is a standard data engineering skill aligned with modern cloud-native data platforms.
AI index (0 = no AI use, 5 = totally AI-dependent · v2.1)
0.00 / 5
· Title match
· Has AI skill
· AI skill (primary)
· AI skill (secondary)
· On AI team
· Builds AI products
vocab breakdown (legacy)
Assistants (×1):
Frameworks (×2):
Models / concepts (×3):
Evidence — skills matched in JD (10)
Databricks PySpark Apache Spark Spark SQL SQL ETL Data Lake Data Warehouse Medallion Architecture Databricks Workflows
Skill cluster (3 dimension groups, role-scoped)
ETL and ELT Tooling
Apache Spark
Programming Languages for Data Work
SQL
Cross-cutting / unaligned
Databricks PySpark Spark SQL ETL Data Lake Data Warehouse Medallion Architecture Databricks Workflows
Show KRA description ↓
We are looking for a highly skilled Databricks PySpark Developer to join our data platform implementation team. In this role, you will be responsible for designing, developing, and optimizing scalable ETL pipelines and data workflows using Databricks and Apache Spark. You will work closely with data engineers, data scientists, and BI teams to support advanced analytics and reporting requirements. • ETL Development & Data Engineering Design, develop, and maintain scalable ETL processes using Databricks PySpark. Extract, transform, and load data from heterogeneous sources into Data Lake and Data Warehouse environments. Optimize ETL workflows for performance, scalability, and cost efficiency using Spark SQL and PySpark. Implement robust error handling, logging, and monitoring mechanisms for ETL jobs. Design and implement data solutions following Medallion Architecture (Bronze, Silver, Gold layers). Ensure data is cleansed, enriched, validated, and optimized at each layer for analytics consumption. • Data Pipeline Management Hands-on experience in building and managing advanced data pipelines using Databricks Workflows. Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity. Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines. • Data Analysis & Query Optimization Write, review, and optimize complex SQL queries for data transformation, aggregation, and analysis. Perform query tuning and performance optimization on large-scale datasets within Databricks. • Project Coordination & Continuous Improvement Participate in project planning, estimation, and delivery activities. Stay updated with the latest features in Databricks, Spark, and cloud data platforms, and recommend best practices. Document ETL processes, data lineage, metadata, and workflows to support data governance and compliance. Mentor junior developers and contribute to team knowledge sharing where required.

Signals

Skill data-engineer
0.20
Alias data-engineer
1.00
KRA data-engineer
0.68

Post-classification

Centroidupdated · n=226
Alias collision log
New-role queue
New skills captured7
New KRA captured

Captured for admin review

PySpark primary Data Engineer pending
Spark SQL primary Data Engineer pending
ETL primary Data Engineer pending
Data Lake primary Data Engineer pending
Data Warehouse primary Data Engineer pending
Medallion Architecture primary Data Engineer pending
Databricks Workflows primary Data Engineer pending
Status: completed Created: 2026-05-27T14:50:34.284364Z Updated: 2026-06-12T17:12:05.014956Z API 3 duration: 12156 ms
Flow Current 3-step pipeline

1 POST /skills/extract-from-jd

2 POST /skills/extract-details

3 POST /skills/final-role-output

Role Chosen role & resolution

Data Engineer

CASE A

slug: data-engineer · id: 2 · source: db

Exact alias hit on data-engineer (1.0) — no other alias at this confidence; skill_top data-engineer 0.20 does not contradict

Resolution: in_db — role exists in library; skill↔dim and role↔dim links saved when applicable.

0
New skills
0
Skill↔dim saved
0
Role↔dim saved
3
Skipped

Job description

Role: Databricks PySpark Developer

Experience: 5+ years

Location: Bangalore (onsite-5days) /no relocation candidates

Notice period-immediate joiners/serving notice period

Role Overview :

We are looking for a highly skilled Databricks PySpark Developer to join our data platform implementation team. In this role, you will be responsible for designing, developing, and optimizing scalable ETL pipelines and data workflows using Databricks and Apache Spark. You will work closely with data engineers, data scientists, and BI teams to support advanced analytics and reporting requirements.

Key Responsibilities :

• ETL Development & Data Engineering Design, develop, and maintain scalable ETL processes using Databricks PySpark. Extract, transform, and load data from heterogeneous sources into Data Lake and Data Warehouse environments. Optimize ETL workflows for performance, scalability, and cost efficiency using Spark SQL and PySpark. Implement robust error handling, logging, and monitoring mechanisms for ETL jobs. Design and implement data solutions following Medallion Architecture (Bronze, Silver, Gold layers). Ensure data is cleansed, enriched, validated, and optimized at each layer for analytics consumption.
• Data Pipeline Management Hands-on experience in building and managing advanced data pipelines using Databricks Workflows. Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity. Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines.
• Data Analysis & Query Optimization Write, review, and optimize complex SQL queries for data transformation, aggregation, and analysis. Perform query tuning and performance optimization on large-scale datasets within Databricks.
• Project Coordination & Continuous Improvement Participate in project planning, estimation, and delivery activities. Stay updated with the latest features in Databricks, Spark, and cloud data platforms, and recommend best practices. Document ETL processes, data lineage, metadata, and workflows to support data governance and compliance. Mentor junior developers and contribute to team knowledge sharing where required.


Required Qualifications :

Bachelor’s degree in Computer Science, Engineering, or a related field.

5+ years of experience in ETL/Data Engineering roles with strong focus on Databricks PySpark.

Strong proficiency in Python, with hands-on experience in developing and debugging PySpark applications.

In-depth understanding of Apache Spark architecture, including RDDs, DataFrames, and Spark SQL.

Expertise in SQL development and optimization for large-scale data processing.

Proven experience working with data warehousing concepts and ETL frameworks.

Strong problem-solving and troubleshooting skills.

Excellent communication and collaboration skills.

Preferred Qualifications :

Experience working on cloud platforms, preferably AWS.

Hands-on experience with tools such as Databricks, Snowflake, Tableau, or similar data platforms.

Strong understanding of data governance, data quality, and best practices in data engineering.

Relevant certifications in Databricks, PySpark, Spark SQL, or cloud technologies.

Skills from this JD

Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.

Databricks Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)
Canonical: Databricks id=1202 · databricks

Aliases — catalog

  • Databricks (CANONICAL)

Context tags (catalog)

Apache Spark Databricks Runtime Delta Lake MLflow SQL Analytics Spark cloud integration collaborative workspace data engineering data lakes data pipelines data visualization job scheduling machine learning notebooks real-time analytics

Stored enrichment (catalog DB)

Category
Platform
Sub-category
Data Analytics Platform
Vendor
Databricks, Inc.
License
other_open
Year introduced
2013
Confidence
0.97
Version strategy
NOT_APPLICABLE

Maturity reasoning: Databricks appears frequently in data engineering and analytics job postings, especially alongside Spark, Delta Lake, and lakehouse stacks; strong vendor adoption and broad enterprise usage signal mainstream demand.

Skill profile (library / DB)

Skill nature
PLATFORM
Volatility
STABLE
Typical lifespan
EVERGREEN
Category id
9
Sub-category id
911
Extractable
True
Also category
False

Dimensions (API 2 worklist)

  • React Frontend Development Catalog dimension db id 96

    Library dimension (catalog)

API 3 link attempts (this skill)

Dimension Skill↔dim Role↔dim Outcome
React Frontend Development
d_init_01
Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
PySpark Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)
Canonical: Apache Spark id=1350 · apache-spark

Aliases — catalog

  • Apache Spark (CANONICAL)
  • apache spark 3 (VERSION)
  • spark (VERSION)
  • spark 3 (VERSION)
  • spark 3.x (VERSION)
  • spark3 (VERSION)

Context tags (catalog)

Apache Kafka Cluster Manager DAGScheduler Data Lake DataFrame ETL Hadoop MLlib Machine Learning PySpark RDD Scala Spark SQL Spark Streaming SparkSession

Stored enrichment (catalog DB)

Category
Framework
Sub-category
Distributed Data Processing Framework
Vendor
Apache Software Foundation
License
apache_2
Year introduced
2010
Confidence
0.94
Version strategy
SEPARATE_ENTITY
Version tag
3.x

Maturity reasoning: Apache Spark appears in many data engineering JDs and remains a standard for distributed ETL/ELT; its GitHub and vendor ecosystem activity stay strong, with Databricks and cloud platforms still promoting it.

Skill profile (library / DB)

Skill nature
FRAMEWORK
Volatility
STABLE
Typical lifespan
EVERGREEN
Category id
5
Sub-category id
1021
Extractable
True
Also category
False

Dimensions (API 2 worklist)

  • ETL and ELT Tooling Catalog dimension db id 24

    Library dimension (catalog)

    Roles linked in library: Data Engineer

API 3 link attempts (this skill)

Dimension Skill↔dim Role↔dim Outcome
ETL and ELT Tooling
etl-and-elt-tooling
Skipped — no persistable v3 meta for new skill
skill_not_in_db_v3_proposed
Apache Spark Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)
Canonical: Apache Spark id=1350 · apache-spark

Aliases — catalog

  • Apache Spark (CANONICAL)
  • apache spark 3 (VERSION)
  • spark (VERSION)
  • spark 3 (VERSION)
  • spark 3.x (VERSION)
  • spark3 (VERSION)

Context tags (catalog)

Apache Kafka Cluster Manager DAGScheduler Data Lake DataFrame ETL Hadoop MLlib Machine Learning PySpark RDD Scala Spark SQL Spark Streaming SparkSession

Stored enrichment (catalog DB)

Category
Framework
Sub-category
Distributed Data Processing Framework
Vendor
Apache Software Foundation
License
apache_2
Year introduced
2010
Confidence
0.94
Version strategy
SEPARATE_ENTITY
Version tag
3.x

Maturity reasoning: Apache Spark appears in many data engineering JDs and remains a standard for distributed ETL/ELT; its GitHub and vendor ecosystem activity stay strong, with Databricks and cloud platforms still promoting it.

Skill profile (library / DB)

Skill nature
FRAMEWORK
Volatility
STABLE
Typical lifespan
EVERGREEN
Category id
5
Sub-category id
1021
Extractable
True
Also category
False

Dimensions (API 2 worklist)

  • ETL and ELT Tooling Catalog dimension db id 24

    Library dimension (catalog)

    Roles linked in library: Data Engineer

API 3 link attempts (this skill)

Dimension Skill↔dim Role↔dim Outcome
ETL and ELT Tooling
etl-and-elt-tooling
Existing dimension (library) · Role↔dimension saved
Spark SQL Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields
Category
Data Engineering Tools
Sub-category
general
Skill nature
TOOL
Volatility
MEDIUM
Typical lifespan
MULTI_YEAR
Version strategy
UNVERSIONED
SQL Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)
Canonical: SQL id=101 · sql

Aliases — catalog

  • SQL (CANONICAL) primary

Context tags (catalog)

ACID CTE DDL DML ETL JOIN MySQL NoSQL OLAP ORM PostgreSQL SQL injection SQLite T-SQL data modeling data warehousing database normalization execution plan indexing joins normalization query optimization stored procedures subquery transaction isolation transaction management window functions

Stored enrichment (catalog DB)

Category
Language
Sub-category
Query Language
Vendor
ANSI
License
unknown
Year introduced
1974
Confidence
0.99
Version strategy
NOT_APPLICABLE

Maturity reasoning: SQL appears in a large share of data, backend, and analytics job descriptions and remains the default query language for PostgreSQL, MySQL, and cloud warehouses like Snowflake/BigQuery.

Skill profile (library / DB)

Skill nature
LANGUAGE
Volatility
STABLE
Typical lifespan
EVERGREEN
Category id
6
Sub-category id
97
Extractable
True
Also category
False

Dimensions (API 2 worklist)

  • Pega Programming Languages & DSLs Catalog dimension db id 267

    Library dimension (catalog)

    Roles linked in library: Pega Developer

  • Programming Languages for Data Work Catalog dimension db id 21

    Library dimension (catalog)

    Roles linked in library: Data Engineer

API 3 link attempts (this skill)

Dimension Skill↔dim Role↔dim Outcome
Pega Programming Languages & DSLs
pega-programming-languages-dsls
Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
Programming Languages for Data Work
programming-languages-for-data-work
Existing dimension (library) · Role↔dimension saved
ETL Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields
Category
Data Engineering Tools
Sub-category
general
Skill nature
PRACTICE
Volatility
MEDIUM
Typical lifespan
MULTI_YEAR
Version strategy
UNVERSIONED
Data Lake Primary Library skill API 3: existing canonical (in_db) Existing skill (matched library)
Canonical: Data Lakes id=1358 · data-lakes

Aliases — catalog

  • Data Lakes (CANONICAL)

Context tags (catalog)

AWS Lake Formation Azure Data Lake ETL big data data catalog data governance data ingestion data lakes vs data warehouses data modeling data pipelines data warehousing partitioning real-time analytics schema evolution serverless architecture

Stored enrichment (catalog DB)

Category
Architecture
Sub-category
Data Lake Architecture
Confidence
0.90
Version strategy
NOT_APPLICABLE

Maturity reasoning: Data lakes are widely listed in cloud/data platform job descriptions and are a standard architecture in AWS, Azure, and GCP ecosystems; they’re a common hiring-pipeline staple rather than a niche pattern.

Skill profile (library / DB)

Skill nature
PATTERN
Volatility
STABLE
Typical lifespan
EVERGREEN
Category id
1
Sub-category id
1025
Extractable
True
Also category
False

Dimensions (API 2 worklist)

  • Cloud Storage and Data Services Catalog dimension db id 144

    Library dimension (catalog)

    Roles linked in library: Cloud Architect

  • React Frontend Development Catalog dimension db id 96

    Library dimension (catalog)

API 3 link attempts (this skill)

Dimension Skill↔dim Role↔dim Outcome
Cloud Storage and Data Services
cloud-storage-and-data-services
Skipped — no persistable v3 meta for new skill
skill_not_in_db_v3_proposed
React Frontend Development
d_init_01
Skipped — no persistable v3 meta for new skill
skill_not_in_db_v3_proposed
Data Warehouse Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields
Category
Databases
Sub-category
general
Skill nature
CONCEPT
Volatility
STABLE
Typical lifespan
EVERGREEN
Version strategy
UNVERSIONED
Medallion Architecture Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields
Category
Data Engineering Tools
Sub-category
general
Skill nature
CONCEPT
Volatility
MEDIUM
Typical lifespan
MULTI_YEAR
Version strategy
UNVERSIONED
Databricks Workflows Primary New / orchestrated API 3: new canonical path (new) New / unmatched skill (orchestrated in API 2)

Skill enrichment (orchestrator / LLM)

No Stage 7 enrichment blob on this skill (orchestrator skipped enrichment).

Derived legacy fields
Category
Data Engineering Tools
Sub-category
general
Skill nature
TOOL
Volatility
MEDIUM
Typical lifespan
MULTI_YEAR
Version strategy
UNVERSIONED

All API 3 persistence rows

Same grid as the skill-extractor “Persistence items” table: one row per (skill × dimension) work item.

Skill Tag Dimension Skill↔dim Role↔dim Outcome Notes
Databricks in_db
React Frontend Development
d_init_01
Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
PySpark new
ETL and ELT Tooling
etl-and-elt-tooling
Skipped — no persistable v3 meta for new skill skill_not_in_db_v3_proposed
Apache Spark in_db
ETL and ELT Tooling
etl-and-elt-tooling
Existing dimension (library) · Role↔dimension saved
SQL in_db
Pega Programming Languages & DSLs
pega-programming-languages-dsls
Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role)
SQL in_db
Programming Languages for Data Work
programming-languages-for-data-work
Existing dimension (library) · Role↔dimension saved
Data Lake new
Cloud Storage and Data Services
cloud-storage-and-data-services
Skipped — no persistable v3 meta for new skill skill_not_in_db_v3_proposed
Data Lake new
React Frontend Development
d_init_01
Skipped — no persistable v3 meta for new skill skill_not_in_db_v3_proposed

Library artifacts (this run)

Kind Detail DB id
canonical_skill_proposed Spark SQL | type=Data Engineering Tools subtype=general nature=TOOL lifespan=MULTI_YEAR
canonical_skill_proposed ETL | type=Data Engineering Tools subtype=general nature=PRACTICE lifespan=MULTI_YEAR
canonical_skill_proposed Data Warehouse | type=Databases subtype=general nature=CONCEPT lifespan=EVERGREEN
canonical_skill_proposed Medallion Architecture | type=Data Engineering Tools subtype=general nature=CONCEPT lifespan=MULTI_YEAR
canonical_skill_proposed Databricks Workflows | type=Data Engineering Tools subtype=general nature=TOOL lifespan=MULTI_YEAR
dimension_skill_link_proposed PySpark ↔ ETL and ELT Tooling
role_dimension_link_proposed Data Engineer ↔ ETL and ELT Tooling
dimension_skill_link_proposed Data Lake ↔ Cloud Storage and Data Services
dimension_skill_link_proposed Data Lake ↔ React Frontend Development
nano JD Parser — gpt-4.1-nano click to toggle
RoleDatabricks PySpark Developer
Experience5+ years
DomainIT Services & Consulting
Location Bangalore, India (onsite)
JD type pass

Certifications

Databricks PySpark Spark SQL
Show raw JSON
{
  "JD_type": "pass",
  "about_company": null,
  "certifications": [
    "Databricks",
    "PySpark",
    "Spark SQL"
  ],
  "company_name": null,
  "ctc": null,
  "domain": {
    "primary": {
      "aliases": [],
      "domain": "IT Services \u0026 Consulting"
    },
    "secondary": null
  },
  "education": [
    {
      "level": "Bachelor\u0027s",
      "qualification": "BTECH/BE - Computer Science (or related)",
      "raw": "Bachelor\u2019s degree in Computer Science, Engineering, or a related field.",
      "requirement": "required"
    }
  ],
  "experience": {
    "max": null,
    "min": 5,
    "raw": "5+ years"
  },
  "job_locations": [
    {
      "aliases": [
        "Bengaluru"
      ],
      "city": "Bangalore",
      "country": "India",
      "state": null,
      "work_mode": "onsite"
    }
  ],
  "role": "Databricks PySpark Developer",
  "role_aliases": [
    "PySpark Developer",
    "ETL Developer",
    "Data Engineer"
  ],
  "role_archetype": "Data",
  "roles_and_responsibilities": [
    {
      "bullet_count": 0,
      "heading": "Role Overview",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "We are looking for a",
        "last_5_words": "analytics and reporting requirements."
      },
      "text": "We are looking for a highly skilled Databricks PySpark Developer to join our data platform implementation team. In this role, you will be responsible for designing, developing, and optimizing scalable ETL pipelines and data workflows using Databricks and Apache Spark. You will work closely with data engineers, data scientists, and BI teams to support advanced analytics and reporting requirements.",
      "word_count": 52
    },
    {
      "bullet_count": 4,
      "heading": "Key Responsibilities",
      "heading_was_present": true,
      "source_marker": {
        "first_5_words": "\u2022 ETL Development \u0026 Data Engineering",
        "last_5_words": "knowledge sharing where required."
      },
      "text": "\u2022 ETL Development \u0026 Data Engineering Design, develop, and maintain scalable ETL processes using Databricks PySpark. Extract, transform, and load data from heterogeneous sources into Data Lake and Data Warehouse environments. Optimize ETL workflows for performance, scalability, and cost efficiency using Spark SQL and PySpark. Implement robust error handling, logging, and monitoring mechanisms for ETL jobs. Design and implement data solutions following Medallion Architecture (Bronze, Silver, Gold layers). Ensure data is cleansed, enriched, validated, and optimized at each layer for analytics consumption.\n\u2022 Data Pipeline Management Hands-on experience in building and managing advanced data pipelines using Databricks Workflows. Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity. Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines.\n\u2022 Data Analysis \u0026 Query Optimization Write, review, and optimize complex SQL queries for data transformation, aggregation, and analysis. Perform query tuning and performance optimization on large-scale datasets within Databricks.\n\u2022 Project Coordination \u0026 Continuous Improvement Participate in project planning, estimation, and delivery activities. Stay updated with the latest features in Databricks, Spark, and cloud data platforms, and recommend best practices. Document ETL processes, data lineage, metadata, and workflows to support data governance and compliance. Mentor junior developers and contribute to team knowledge sharing where required.",
      "word_count": 309
    }
  ],
  "urls": []
}
API 1 — extract-from-jd click to toggle
{
  "final_skills": [
    {
      "is_primary": true,
      "skill_name": "Databricks"
    },
    {
      "is_primary": true,
      "skill_name": "PySpark"
    },
    {
      "is_primary": true,
      "skill_name": "Apache Spark"
    },
    {
      "is_primary": true,
      "skill_name": "Spark SQL"
    },
    {
      "is_primary": true,
      "skill_name": "SQL"
    },
    {
      "is_primary": true,
      "skill_name": "ETL"
    },
    {
      "is_primary": true,
      "skill_name": "Data Lake"
    },
    {
      "is_primary": true,
      "skill_name": "Data Warehouse"
    },
    {
      "is_primary": true,
      "skill_name": "Medallion Architecture"
    },
    {
      "is_primary": true,
      "skill_name": "Databricks Workflows"
    }
  ],
  "jd_role": {
    "display_name": "Databricks PySpark Developer",
    "rationale": null,
    "role_aliases": [
      "PySpark Developer",
      "ETL Developer",
      "Data Engineer"
    ],
    "role_archetype": "Data",
    "slug": ""
  },
  "nano_parsed": {
    "JD_type": "pass",
    "about_company": null,
    "certifications": [
      "Databricks",
      "PySpark",
      "Spark SQL"
    ],
    "company_name": null,
    "ctc": null,
    "domain": {
      "primary": {
        "aliases": [],
        "domain": "IT Services \u0026 Consulting"
      },
      "secondary": null
    },
    "education": [
      {
        "level": "Bachelor\u0027s",
        "qualification": "BTECH/BE - Computer Science (or related)",
        "raw": "Bachelor\u2019s degree in Computer Science, Engineering, or a related field.",
        "requirement": "required"
      }
    ],
    "experience": {
      "max": null,
      "min": 5,
      "raw": "5+ years"
    },
    "job_locations": [
      {
        "aliases": [
          "Bengaluru"
        ],
        "city": "Bangalore",
        "country": "India",
        "state": null,
        "work_mode": "onsite"
      }
    ],
    "role": "Databricks PySpark Developer",
    "role_aliases": [
      "PySpark Developer",
      "ETL Developer",
      "Data Engineer"
    ],
    "role_archetype": "Data",
    "roles_and_responsibilities": [
      {
        "bullet_count": 0,
        "heading": "Role Overview",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "We are looking for a",
          "last_5_words": "analytics and reporting requirements."
        },
        "text": "We are looking for a highly skilled Databricks PySpark Developer to join our data platform implementation team. In this role, you will be responsible for designing, developing, and optimizing scalable ETL pipelines and data workflows using Databricks and Apache Spark. You will work closely with data engineers, data scientists, and BI teams to support advanced analytics and reporting requirements.",
        "word_count": 52
      },
      {
        "bullet_count": 4,
        "heading": "Key Responsibilities",
        "heading_was_present": true,
        "source_marker": {
          "first_5_words": "\u2022 ETL Development \u0026 Data Engineering",
          "last_5_words": "knowledge sharing where required."
        },
        "text": "\u2022 ETL Development \u0026 Data Engineering Design, develop, and maintain scalable ETL processes using Databricks PySpark. Extract, transform, and load data from heterogeneous sources into Data Lake and Data Warehouse environments. Optimize ETL workflows for performance, scalability, and cost efficiency using Spark SQL and PySpark. Implement robust error handling, logging, and monitoring mechanisms for ETL jobs. Design and implement data solutions following Medallion Architecture (Bronze, Silver, Gold layers). Ensure data is cleansed, enriched, validated, and optimized at each layer for analytics consumption.\n\u2022 Data Pipeline Management Hands-on experience in building and managing advanced data pipelines using Databricks Workflows. Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity. Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines.\n\u2022 Data Analysis \u0026 Query Optimization Write, review, and optimize complex SQL queries for data transformation, aggregation, and analysis. Perform query tuning and performance optimization on large-scale datasets within Databricks.\n\u2022 Project Coordination \u0026 Continuous Improvement Participate in project planning, estimation, and delivery activities. Stay updated with the latest features in Databricks, Spark, and cloud data platforms, and recommend best practices. Document ETL processes, data lineage, metadata, and workflows to support data governance and compliance. Mentor junior developers and contribute to team knowledge sharing where required.",
        "word_count": 309
      }
    ],
    "urls": []
  },
  "rejected": false,
  "rejection_reason": null,
  "run_id": "c9b2985f-96f8-4315-8d64-ebc774498e2d",
  "stage3_signals": {
    "alias_found": true,
    "alias_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": null,
        "matched_count": null,
        "matched_skills": null,
        "role_id": 2,
        "score": 1.0,
        "slug": "data-engineer",
        "total_count": null
      }
    ],
    "kra_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": [
          {
            "kra_text": "Implements data transformation, cleansing, deduplication, and enrichment logic to convert raw source data into analytics-ready curated datasets.",
            "sentence": "Ensure data is cleansed, enriched, validated, and optimized at each layer for analytics consumption.",
            "similarity": 0.696
          },
          {
            "kra_text": "Works with data analysts, data scientists, and business stakeholders to define data models, ingestion schedules, and data delivery requirements.",
            "sentence": "Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines.",
            "similarity": 0.6691
          },
          {
            "kra_text": "Maintains data catalog entries, column-level data lineage, and technical documentation to support data discoverability and governance across the organization.",
            "sentence": "Document ETL processes, data lineage, metadata, and workflows to support data governance and compliance.",
            "similarity": 0.6656
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 2,
        "score": 0.6769,
        "slug": "data-engineer",
        "total_count": null
      },
      {
        "display_name": "Fullstack Developer",
        "kra_matches": [
          {
            "kra_text": "Designs and queries relational databases like PostgreSQL and document stores like MongoDB, writing migrations, indexes, and optimized queries.",
            "sentence": "Data Analysis \u0026 Query Optimization Write, review, and optimize complex SQL queries for data transformation, aggregation, and analysis.",
            "similarity": 0.5774
          },
          {
            "kra_text": "Delivers features through CI/CD pipelines using automated tests, staged rollouts, feature flags, and incremental deployments.",
            "sentence": "Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity.",
            "similarity": 0.4976
          },
          {
            "kra_text": "Designs and queries relational databases like PostgreSQL and document stores like MongoDB, writing migrations, indexes, and optimized queries.",
            "sentence": "Perform query tuning and performance optimization on large-scale datasets within Databricks.",
            "similarity": 0.4848
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 15,
        "score": 0.5199,
        "slug": "full-stack-engineer",
        "total_count": null
      },
      {
        "display_name": "DevOps Engineer",
        "kra_matches": [
          {
            "kra_text": "Monitors CI/CD pipeline reliability, identifies bottlenecks in delivery workflows, and improves deployment frequency, lead time, and failure recovery rate.",
            "sentence": "Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity.",
            "similarity": 0.554
          },
          {
            "kra_text": "Collaborates with development teams to improve build processes, reduce deployment friction, containerize applications, and adopt DevOps best practices.",
            "sentence": "Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines.",
            "similarity": 0.5081
          },
          {
            "kra_text": "Collaborates with development teams to improve build processes, reduce deployment friction, containerize applications, and adopt DevOps best practices.",
            "sentence": "Mentor junior developers and contribute to team knowledge sharing where required.",
            "similarity": 0.4943
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 10,
        "score": 0.5188,
        "slug": "devops-engineer",
        "total_count": null
      },
      {
        "display_name": "ML Engineer",
        "kra_matches": [
          {
            "kra_text": "Prepares, cleans, and transforms training datasets, manages feature stores, and builds feature engineering pipelines for model training.",
            "sentence": "ETL Development \u0026 Data Engineering Design, develop, and maintain scalable ETL processes using Databricks PySpark.",
            "similarity": 0.5113
          },
          {
            "kra_text": "Prepares, cleans, and transforms training datasets, manages feature stores, and builds feature engineering pipelines for model training.",
            "sentence": "Data Pipeline Management Hands-on experience in building and managing advanced data pipelines using Databricks Workflows.",
            "similarity": 0.4995
          },
          {
            "kra_text": "Prepares, cleans, and transforms training datasets, manages feature stores, and builds feature engineering pipelines for model training.",
            "sentence": "Develop and maintain reliable, reusable, and scalable pipelines ensuring data quality and integrity.",
            "similarity": 0.4839
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 3,
        "score": 0.4982,
        "slug": "ml-engineer",
        "total_count": null
      },
      {
        "display_name": "Svelte Frontend Developer",
        "kra_matches": [
          {
            "kra_text": "performance tuning",
            "sentence": "Perform query tuning and performance optimization on large-scale datasets within Databricks.",
            "similarity": 0.5384
          },
          {
            "kra_text": "backend data integration",
            "sentence": "Collaborate with cross-functional teams to translate business and analytics requirements into efficient data pipelines.",
            "similarity": 0.4781
          },
          {
            "kra_text": "backend data integration",
            "sentence": "Ensure data is cleansed, enriched, validated, and optimized at each layer for analytics consumption.",
            "similarity": 0.4507
          }
        ],
        "matched_count": null,
        "matched_skills": null,
        "role_id": 92,
        "score": 0.4891,
        "slug": "svelte-frontend-developer",
        "total_count": null
      }
    ],
    "skill_match_roles": [
      {
        "display_name": "Data Engineer",
        "kra_matches": null,
        "matched_count": 2,
        "matched_skills": [
          "Apache Spark",
          "SQL"
        ],
        "role_id": 2,
        "score": 0.2,
        "slug": "data-engineer",
        "total_count": 10
      },
      {
        "display_name": "Pega Developer",
        "kra_matches": null,
        "matched_count": 1,
        "matched_skills": [
          "SQL"
        ],
        "role_id": 24,
        "score": 0.1,
        "slug": "pega-developer",
        "total_count": 10
      }
    ]
  },
  "stage4_decision": {
    "alias_collision_detected": false,
    "case": "A",
    "chosen_role": {
      "display_name": "Data Engineer",
      "kra_matches": null,
      "matched_count": null,
      "matched_skills": null,
      "role_id": 2,
      "score": 1.0,
      "slug": "data-engineer",
      "total_count": null
    },
    "confidence": 1.0,
    "is_new_role": false,
    "llm2_fired": false,
    "llm2_reasoning": null,
    "matched_dimensions": [],
    "matched_kras": [],
    "matched_skills": [],
    "new_role_display_name": null,
    "new_role_slug": null,
    "queued": false,
    "reasoning": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.20 does not contradict",
    "sub_role": null
  },
  "stage5_updates": {
    "centroid_n_after": 226,
    "centroid_updated": true,
    "collision_log_id": null,
    "new_kra_attached": null,
    "new_skills_attached": [
      {
        "is_primary": true,
        "queue_id": 11203,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "PySpark",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 11204,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Spark SQL",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 11205,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "ETL",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 11206,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Lake",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 11207,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Data Warehouse",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 11208,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Medallion Architecture",
        "status": "pending"
      },
      {
        "is_primary": true,
        "queue_id": 11209,
        "role_display_name": "Data Engineer",
        "role_slug": "data-engineer",
        "skill_name": "Databricks Workflows",
        "status": "pending"
      }
    ],
    "queue_entry_id": null,
    "v3_pipeline_triggered": false,
    "v3_role_slug": null,
    "v3_run_id": null
  }
}
API 2 — extract-details
{
  "alias_matches": [
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 1838,
      "existing_alias_text": "Databricks",
      "input_term": "Databricks",
      "matched_canonical": {
        "category_id": 9,
        "display_name": "Databricks",
        "id": 1202,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "PLATFORM",
        "slug": "databricks",
        "sub_category_id": 911,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "TODO: REMOVE AFTER TESTING \u2014 alias DB write disabled",
      "alias_persisted": false,
      "existing_alias_id": 2004,
      "existing_alias_text": "Apache Spark",
      "input_term": "PySpark",
      "matched_canonical": {
        "category_id": 5,
        "display_name": "Apache Spark",
        "id": 1350,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "FRAMEWORK",
        "slug": "apache-spark",
        "sub_category_id": 1021,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "embedding_alias"
    },
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 2004,
      "existing_alias_text": "Apache Spark",
      "input_term": "Apache Spark",
      "matched_canonical": {
        "category_id": 5,
        "display_name": "Apache Spark",
        "id": 1350,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "FRAMEWORK",
        "slug": "apache-spark",
        "sub_category_id": 1021,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
      "alias_persisted": false,
      "existing_alias_id": 271,
      "existing_alias_text": "SQL",
      "input_term": "SQL",
      "matched_canonical": {
        "category_id": 6,
        "display_name": "SQL",
        "id": 101,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "LANGUAGE",
        "slug": "sql",
        "sub_category_id": 97,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "alias"
    },
    {
      "alias_persist_skipped_reason": "TODO: REMOVE AFTER TESTING \u2014 alias DB write disabled",
      "alias_persisted": false,
      "existing_alias_id": 2017,
      "existing_alias_text": "Data Lakes",
      "input_term": "Data Lake",
      "matched_canonical": {
        "category_id": 1,
        "display_name": "Data Lakes",
        "id": 1358,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "PATTERN",
        "slug": "data-lakes",
        "sub_category_id": 1025,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "matched_via": "embedding_alias"
    }
  ],
  "candidate_roles": [
    {
      "display_name": "Data Engineer",
      "id": 2,
      "rationale": null,
      "role_archetype": null,
      "slug": "data-engineer",
      "source": "db"
    },
    {
      "display_name": "Pega Developer",
      "id": 24,
      "rationale": null,
      "role_archetype": null,
      "slug": "pega-developer",
      "source": "db"
    },
    {
      "display_name": "Cloud Architect",
      "id": 9,
      "rationale": null,
      "role_archetype": null,
      "slug": "cloud-architect",
      "source": "db"
    }
  ],
  "chosen_role": {
    "display_name": "Data Engineer",
    "id": 2,
    "rationale": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.20 does not contradict",
    "role_archetype": null,
    "slug": "data-engineer",
    "source": "db"
  },
  "dimensions": [
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "React Frontend Development",
        "id": 96,
        "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
        "slug": "d_init_01",
        "source": "db"
      },
      "input_skill": "Databricks",
      "llm_role": null,
      "roles_from_db": []
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "ETL and ELT Tooling",
        "id": 24,
        "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
        "slug": "etl-and-elt-tooling",
        "source": "db"
      },
      "input_skill": "PySpark",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "ETL and ELT Tooling",
        "id": 24,
        "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
        "slug": "etl-and-elt-tooling",
        "source": "db"
      },
      "input_skill": "Apache Spark",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Pega Programming Languages \u0026 DSLs",
        "id": 267,
        "rationale": "Programming languages and domain-specific languages used in Pega development.",
        "slug": "pega-programming-languages-dsls",
        "source": "db"
      },
      "input_skill": "SQL",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Pega Developer",
          "id": 24,
          "rationale": null,
          "role_archetype": null,
          "slug": "pega-developer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Programming Languages for Data Work",
        "id": 21,
        "rationale": "Languages used to implement data pipelines, transformations, and operational glue. This is the primary coding surface for building ingestion, enrichment, and automation logic in data engineering.",
        "slug": "programming-languages-for-data-work",
        "source": "db"
      },
      "input_skill": "SQL",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Data Engineer",
          "id": 2,
          "rationale": null,
          "role_archetype": null,
          "slug": "data-engineer",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "Cloud Storage and Data Services",
        "id": 144,
        "rationale": "Cloud-native storage and managed data services used to place workloads, choose durability tiers, and define platform boundaries. This is a coherent cluster because architects evaluate storage fit, access patterns, and managed service tradeoffs.",
        "slug": "cloud-storage-and-data-services",
        "source": "db"
      },
      "input_skill": "Data Lake",
      "llm_role": null,
      "roles_from_db": [
        {
          "display_name": "Cloud Architect",
          "id": 9,
          "rationale": null,
          "role_archetype": null,
          "slug": "cloud-architect",
          "source": "db"
        }
      ]
    },
    {
      "dimension": {
        "difficulty_hint": "well_known",
        "display_name": "React Frontend Development",
        "id": 96,
        "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
        "slug": "d_init_01",
        "source": "db"
      },
      "input_skill": "Data Lake",
      "llm_role": null,
      "roles_from_db": []
    }
  ],
  "input_final_skills": [
    "Databricks",
    "PySpark",
    "Apache Spark",
    "Spark SQL",
    "SQL",
    "ETL",
    "Data Lake",
    "Data Warehouse",
    "Medallion Architecture",
    "Databricks Workflows"
  ],
  "input_llm_skills": [
    "Databricks",
    "PySpark",
    "Apache Spark",
    "Spark SQL",
    "SQL",
    "ETL",
    "Data Lake",
    "Data Warehouse",
    "Medallion Architecture",
    "Databricks Workflows"
  ],
  "new_aliases_persisted": 0,
  "run_id": "c9b2985f-96f8-4315-8d64-ebc774498e2d",
  "skills_detail": [
    {
      "aliases_in_db": [
        {
          "alias_text": "Databricks",
          "alias_type": "CANONICAL",
          "id": 1838,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 9,
        "display_name": "Databricks",
        "id": 1202,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "PLATFORM",
        "slug": "databricks",
        "sub_category_id": 911,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "React Frontend Development",
            "id": 96,
            "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
            "slug": "d_init_01",
            "source": "db"
          },
          "input_skill": "Databricks",
          "llm_role": null,
          "roles_from_db": []
        }
      ],
      "input_skill": "Databricks",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "Apache Spark",
          "alias_type": "CANONICAL",
          "id": 2004,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "apache spark 3",
          "alias_type": "VERSION",
          "id": 2006,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark",
          "alias_type": "VERSION",
          "id": 2510,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark 3",
          "alias_type": "VERSION",
          "id": 2007,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark 3.x",
          "alias_type": "VERSION",
          "id": 2009,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark3",
          "alias_type": "VERSION",
          "id": 2008,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 5,
        "display_name": "Apache Spark",
        "id": 1350,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "FRAMEWORK",
        "slug": "apache-spark",
        "sub_category_id": 1021,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "ETL and ELT Tooling",
            "id": 24,
            "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
            "slug": "etl-and-elt-tooling",
            "source": "db"
          },
          "input_skill": "PySpark",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "PySpark",
      "matched_via": "embedding_alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "Apache Spark",
          "alias_type": "CANONICAL",
          "id": 2004,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "apache spark 3",
          "alias_type": "VERSION",
          "id": 2006,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark",
          "alias_type": "VERSION",
          "id": 2510,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark 3",
          "alias_type": "VERSION",
          "id": 2007,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark 3.x",
          "alias_type": "VERSION",
          "id": 2009,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        },
        {
          "alias_text": "spark3",
          "alias_type": "VERSION",
          "id": 2008,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 5,
        "display_name": "Apache Spark",
        "id": 1350,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "FRAMEWORK",
        "slug": "apache-spark",
        "sub_category_id": 1021,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "ETL and ELT Tooling",
            "id": 24,
            "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
            "slug": "etl-and-elt-tooling",
            "source": "db"
          },
          "input_skill": "Apache Spark",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "Apache Spark",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Spark SQL",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "TOOL",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "spark-sql",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "SQL",
          "alias_type": "CANONICAL",
          "id": 271,
          "is_primary": true,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 6,
        "display_name": "SQL",
        "id": 101,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "LANGUAGE",
        "slug": "sql",
        "sub_category_id": 97,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Pega Programming Languages \u0026 DSLs",
            "id": 267,
            "rationale": "Programming languages and domain-specific languages used in Pega development.",
            "slug": "pega-programming-languages-dsls",
            "source": "db"
          },
          "input_skill": "SQL",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Pega Developer",
              "id": 24,
              "rationale": null,
              "role_archetype": null,
              "slug": "pega-developer",
              "source": "db"
            }
          ]
        },
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Programming Languages for Data Work",
            "id": 21,
            "rationale": "Languages used to implement data pipelines, transformations, and operational glue. This is the primary coding surface for building ingestion, enrichment, and automation logic in data engineering.",
            "slug": "programming-languages-for-data-work",
            "source": "db"
          },
          "input_skill": "SQL",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Data Engineer",
              "id": 2,
              "rationale": null,
              "role_archetype": null,
              "slug": "data-engineer",
              "source": "db"
            }
          ]
        }
      ],
      "input_skill": "SQL",
      "matched_via": "alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "ETL",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "PRACTICE",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "etl",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [
        {
          "alias_text": "Data Lakes",
          "alias_type": "CANONICAL",
          "id": 2017,
          "is_primary": false,
          "match_strategy": "CASE_INSENSITIVE"
        }
      ],
      "canonical": {
        "category_id": 1,
        "display_name": "Data Lakes",
        "id": 1358,
        "is_also_category": false,
        "is_extractable": true,
        "skill_nature": "PATTERN",
        "slug": "data-lakes",
        "sub_category_id": 1025,
        "typical_lifespan": "EVERGREEN",
        "volatility": "STABLE"
      },
      "dimensions": [
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "Cloud Storage and Data Services",
            "id": 144,
            "rationale": "Cloud-native storage and managed data services used to place workloads, choose durability tiers, and define platform boundaries. This is a coherent cluster because architects evaluate storage fit, access patterns, and managed service tradeoffs.",
            "slug": "cloud-storage-and-data-services",
            "source": "db"
          },
          "input_skill": "Data Lake",
          "llm_role": null,
          "roles_from_db": [
            {
              "display_name": "Cloud Architect",
              "id": 9,
              "rationale": null,
              "role_archetype": null,
              "slug": "cloud-architect",
              "source": "db"
            }
          ]
        },
        {
          "dimension": {
            "difficulty_hint": "well_known",
            "display_name": "React Frontend Development",
            "id": 96,
            "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
            "slug": "d_init_01",
            "source": "db"
          },
          "input_skill": "Data Lake",
          "llm_role": null,
          "roles_from_db": []
        }
      ],
      "input_skill": "Data Lake",
      "matched_via": "embedding_alias",
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": null,
      "source_tag": "db",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Data Warehouse",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Databases",
          "skill_nature": "CONCEPT",
          "sub_category": "general",
          "typical_lifespan": "EVERGREEN",
          "version_strategy": "UNVERSIONED",
          "volatility": "STABLE"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "data-warehouse",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Medallion Architecture",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "CONCEPT",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "medallion-architecture",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    },
    {
      "aliases_in_db": [],
      "canonical": null,
      "dimensions": [],
      "input_skill": "Databricks Workflows",
      "matched_via": null,
      "new_alias_persisted": false,
      "new_alias_text": null,
      "new_skill_meta": {
        "derived": {
          "category": "Data Engineering Tools",
          "skill_nature": "TOOL",
          "sub_category": "general",
          "typical_lifespan": "MULTI_YEAR",
          "version_strategy": "UNVERSIONED",
          "volatility": "MEDIUM"
        },
        "enrichment": null,
        "keep_log": [],
        "locked_dimensions": [],
        "merge_log": [],
        "placed": null,
        "relationships": null,
        "skill_id": "databricks-workflows",
        "split_log": [],
        "typed": null,
        "warnings": []
      },
      "source_tag": "llm",
      "was_in_llm_skills": true
    }
  ],
  "unmatched_skills": [
    "Spark SQL",
    "ETL",
    "Data Warehouse",
    "Medallion Architecture",
    "Databricks Workflows"
  ]
}
API 3 — final-role-output
{
  "chosen_role": {
    "display_name": "Data Engineer",
    "id": 2,
    "rationale": "Exact alias hit on data-engineer (1.0) \u2014 no other alias at this confidence; skill_top data-engineer 0.20 does not contradict",
    "role_archetype": null,
    "slug": "data-engineer",
    "source": "db"
  },
  "chosen_role_resolution": "in_db",
  "final_input_skills": [
    {
      "skill": "Databricks",
      "tag": "in_db"
    },
    {
      "skill": "PySpark",
      "tag": "in_db"
    },
    {
      "skill": "Apache Spark",
      "tag": "in_db"
    },
    {
      "skill": "Spark SQL",
      "tag": "new"
    },
    {
      "skill": "SQL",
      "tag": "in_db"
    },
    {
      "skill": "ETL",
      "tag": "new"
    },
    {
      "skill": "Data Lake",
      "tag": "in_db"
    },
    {
      "skill": "Data Warehouse",
      "tag": "new"
    },
    {
      "skill": "Medallion Architecture",
      "tag": "new"
    },
    {
      "skill": "Databricks Workflows",
      "tag": "new"
    }
  ],
  "llm_cost_api1_usd": null,
  "llm_cost_api2_usd": null,
  "llm_cost_api3_usd": null,
  "llm_cost_total_usd": null,
  "persistence": {
    "items": [
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "React Frontend Development",
          "id": 96,
          "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
          "slug": "d_init_01",
          "source": "db"
        },
        "dimension_id": 96,
        "input_skill": "Databricks",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
        "role_dimension_saved": false,
        "roles_from_db": [],
        "skill_dimension_saved": true,
        "skill_id": 1202,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "ETL and ELT Tooling",
          "id": 24,
          "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
          "slug": "etl-and-elt-tooling",
          "source": "db"
        },
        "dimension_id": 24,
        "input_skill": "PySpark",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Skipped \u2014 no persistable v3 meta for new skill",
        "role_dimension_saved": false,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": false,
        "skill_id": null,
        "skill_tag": "new",
        "skipped_reason": "skill_not_in_db_v3_proposed"
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "ETL and ELT Tooling",
          "id": 24,
          "rationale": "Packaged tools for extracting, loading, and transforming data across systems. This dimension covers connector-based ingestion, transformation frameworks, and managed integration products.",
          "slug": "etl-and-elt-tooling",
          "source": "db"
        },
        "dimension_id": 24,
        "input_skill": "Apache Spark",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
        "role_dimension_saved": true,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 1350,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Pega Programming Languages \u0026 DSLs",
          "id": 267,
          "rationale": "Programming languages and domain-specific languages used in Pega development.",
          "slug": "pega-programming-languages-dsls",
          "source": "db"
        },
        "dimension_id": 267,
        "input_skill": "SQL",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
        "role_dimension_saved": false,
        "roles_from_db": [
          {
            "display_name": "Pega Developer",
            "id": 24,
            "rationale": null,
            "role_archetype": null,
            "slug": "pega-developer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 101,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Programming Languages for Data Work",
          "id": 21,
          "rationale": "Languages used to implement data pipelines, transformations, and operational glue. This is the primary coding surface for building ingestion, enrichment, and automation logic in data engineering.",
          "slug": "programming-languages-for-data-work",
          "source": "db"
        },
        "dimension_id": 21,
        "input_skill": "SQL",
        "llm_role": null,
        "matched_chosen_role": true,
        "outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
        "role_dimension_saved": true,
        "roles_from_db": [
          {
            "display_name": "Data Engineer",
            "id": 2,
            "rationale": null,
            "role_archetype": null,
            "slug": "data-engineer",
            "source": "db"
          }
        ],
        "skill_dimension_saved": true,
        "skill_id": 101,
        "skill_tag": "in_db",
        "skipped_reason": null
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "Cloud Storage and Data Services",
          "id": 144,
          "rationale": "Cloud-native storage and managed data services used to place workloads, choose durability tiers, and define platform boundaries. This is a coherent cluster because architects evaluate storage fit, access patterns, and managed service tradeoffs.",
          "slug": "cloud-storage-and-data-services",
          "source": "db"
        },
        "dimension_id": 144,
        "input_skill": "Data Lake",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Skipped \u2014 no persistable v3 meta for new skill",
        "role_dimension_saved": false,
        "roles_from_db": [
          {
            "display_name": "Cloud Architect",
            "id": 9,
            "rationale": null,
            "role_archetype": null,
            "slug": "cloud-architect",
            "source": "db"
          }
        ],
        "skill_dimension_saved": false,
        "skill_id": null,
        "skill_tag": "new",
        "skipped_reason": "skill_not_in_db_v3_proposed"
      },
      {
        "chosen_role_id": 2,
        "dimension": {
          "difficulty_hint": "well_known",
          "display_name": "React Frontend Development",
          "id": 96,
          "rationale": "Building interactive web user interfaces with React.js, including component composition, state management, hooks, and rendering patterns. React.js belongs here because it is a core library for client-side UI development in modern web applications.",
          "slug": "d_init_01",
          "source": "db"
        },
        "dimension_id": 96,
        "input_skill": "Data Lake",
        "llm_role": null,
        "matched_chosen_role": false,
        "outcome_line": "Skipped \u2014 no persistable v3 meta for new skill",
        "role_dimension_saved": false,
        "roles_from_db": [],
        "skill_dimension_saved": false,
        "skill_id": null,
        "skill_tag": "new",
        "skipped_reason": "skill_not_in_db_v3_proposed"
      }
    ],
    "new_skills_created": 0,
    "role_dimension_saved": 0,
    "skill_dimension_saved": 0,
    "skipped": 3
  },
  "planner_output": null,
  "run_id": "c9b2985f-96f8-4315-8d64-ebc774498e2d"
}

LLM Calls

Every model call made for this run, in pipeline order. Click a card to see the model's response.

Loading…