Pipeline run
265899c9-6b42-43cb-a0f8-64ac64ac5a98
Client output enrichment
v2 Skill cluster · Nature of work · AI index · Tech stack maturity · Evidence · KRA descriptionvocab breakdown (legacy)
1 POST /skills/extract-from-jd
2 POST /skills/extract-details
3 POST /skills/final-role-output
Data Engineer
slug: data-engineer · id: 6 · source: db
The primary skills indicate a strong focus on data processing, SQL, and Azure technologies, aligning well with a Data Engineer's responsibilities.
Resolution:
in_db
— role exists in library; skill↔dim and role↔dim links saved when applicable.
Job description
About the job The corporation is seeking talented and ambitious big data engineers to join the AI Center of Excellence team The team designs develops and deploys industry leading data science and big data engineering solutions using Artificial Intelligence Machine Learning and big data platforms and technologies to increase efficiency in complex work processes enable and empower data driven decision making planning and execution throughout the lifecycle of projects and improve outcomes to the organization and its customersJob Responsibilities Big data design and analysis data modeling development deployment and CICD operations of big data pipelines Collaborate with a team of data engineers data scientists and business subject matter experts to process data and prepare data sources Mentor other data engineers to develop a world class data engineering team Ingest process and model data from heterogeneous data sources to support data science projects Basic Qualifications Bachelors degree or higher in Computer Science or equivalent degree and 3 to 10 years related working experience In depth experience with a big data cloud platform preferably Azure Strong grasp of programming languages such as Python PySpark or equivalent and willingness to learn new ones Experience writing database heavy services or APIs Experience building and optimizing data pipelines architectures and data sets Working knowledge of queueing stream processing and highly scalable data stores Experience working with and supporting cross functional teams Strong understanding of structuring code for testability Preferred Qualifications Professional experience implementing and maintaining MLOps pipelines in MLflow or AzureML Professional experience implementing data ingestion pipelines using Data Factory Professional experience with Databricks and coding with notebooks Professional experience processing and manipulating data using SQL and Python Professional experience with user training customer support and coordination with cross functional teams
Skills from this JD
Each row merges API 1 extraction, API 2 library match / v3 orchestration (dimensions + locked dims), and API 3 persistence tags.
Aliases — catalog
- Compute right-sizing (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Methodology
- Sub-category
- Capacity Planning Methodology
- Confidence
- 0.78
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Common cloud/capacity-planning practice; widely referenced in AWS/Azure/GCP cost-optimization docs and frequently appears in FinOps and SRE job descriptions focused on reducing overprovisioning.
Skill profile (library / DB)
- Skill nature
- PLATFORM
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 13
- Sub-category id
- 161
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Cloud Platform Operations Catalog dimension db id 26
Library dimension (catalog)
Roles linked in library: DevOps Engineer
-
Cloud Security Platforms Catalog dimension db id 332
Library dimension (catalog)
Roles linked in library: Cybersecurity Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Cloud Platform Operations
cloud-platform-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Cloud Security Platforms
cloud-security-platforms
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — catalog
- Cobalt Strike (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Tool
- Sub-category
- Adversary Simulation Tool
- Vendor
- Fortra
- License
- proprietary
- Year introduced
- 2012
- Confidence
- 0.98
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Appears in a limited set of red-team/pentest JDs and security vendor training, but far below mainstream devops tools; market signal is specialized adversary-simulation usage rather than broad hiring demand.
Skill profile (library / DB)
- Skill nature
- LANGUAGE
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 5
- Sub-category id
- 54
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Analytical Programming Languages Catalog dimension db id 82
Library dimension (catalog)
Roles linked in library: Data Analyst, Data Scientist
-
Automation Scripting and CLI Catalog dimension db id 48
Library dimension (catalog)
Roles linked in library: Azure Cloud Engineer, Cloud Engineer
-
Automation and Scripting for Operations Catalog dimension db id 361
Library dimension (catalog)
Roles linked in library: Virtualization Engineer
-
Network Automation and Scripting Catalog dimension db id 285
Library dimension (catalog)
Roles linked in library: Network Engineer
-
Programming Languages for AI Workflows Catalog dimension db id 261
Library dimension (catalog)
Roles linked in library: AI Engineer
-
Programming Languages for Backend Systems Catalog dimension db id 140
Library dimension (catalog)
Roles linked in library: Backend Engineer
-
Programming Languages for Data Work Catalog dimension db id 67
Library dimension (catalog)
Roles linked in library: Data Engineer
-
Programming Languages for ML Systems Catalog dimension db id 113
Library dimension (catalog)
Roles linked in library: Machine Learning Engineer
-
Programming Languages for Security Work Catalog dimension db id 328
Library dimension (catalog)
Roles linked in library: Cybersecurity Engineer
-
Programming Languages for Test Automation Catalog dimension db id 193
Library dimension (catalog)
Roles linked in library: Automation Tester
-
Security Automation and Scripting Catalog dimension db id 258
Library dimension (catalog)
Roles linked in library: Cybersecurity Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Analytical Programming Languages
analytical-programming-languages
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Automation Scripting and CLI
automation-scripting-and-cli
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Automation and Scripting for Operations
automation-and-scripting-for-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Network Automation and Scripting
network-automation-and-scripting
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Programming Languages for AI Workflows
programming-languages-for-ai-workflows
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Programming Languages for Backend Systems
programming-languages-for-backend-systems
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Programming Languages for Data Work
programming-languages-for-data-work
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved |
|
Programming Languages for ML Systems
programming-languages-for-ml-systems
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Programming Languages for Security Work
programming-languages-for-security-work
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Programming Languages for Test Automation
programming-languages-for-test-automation
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Security Automation and Scripting
security-automation-and-scripting
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
PySpark appears in many data engineering and analytics job descriptions, especially for Spark-based ETL and ML pipelines; it remains a standard skill alongside Databricks and AWS EMR.
Apache Software Foundation ·apache_2 ·since 2010 (0.98)
PySpark is a specific Python API for Apache Spark and is usually named distinctly in JDs. It is unlikely to be reasonably confused with another catalog skill in typical job descriptions.
Not versioned
Library ·data_processing_library confidence 0.93
PySpark is best classified as a Library because it is a Python package imported and used from application code, rather than a hosted environment or a framework you build inside.
- Category
- Library
- Sub-category
- data_processing_library
- Skill nature
- LIBRARY
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Analytical Programming and Notebook Languages Proposed / LLM
Proposed / LLM dimension (no DB id yet)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
Locked dimensions (v3 placement)
-
Analytical Programming and Notebook Languages
Pipeline tentative id
Languages and notebook/script-based coding used to clean, transform, analyze, and prototype data workflows and models. Includes Python, pandas, SQL, PySpark, notebook scripting, dataframe manipulation, exploratory analysis, ETL/data transformation logic, and other reproducible analytical code.
-
Distributed Data Processing
Pipeline tentative id
Covers writing and optimizing distributed batch or streaming data transformations on large datasets. PySpark belongs here because it is a Spark-based API used to express parallel data processing jobs at scale.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Analytical Programming and Notebook Languages
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) |
|
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — from this run (catalog unavailable)
- SQL (CANONICAL)
Skill profile (library / DB)
- Skill nature
- LANGUAGE
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 5
- Sub-category id
- 55
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Relational Data Modeling Catalog dimension db id 71
Library dimension (catalog)
Roles linked in library: Backend Engineer, Data Engineer
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Relational Data Modeling
relational-data-modeling
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved |
|
Version Control Systems
d_init_01
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — catalog
- effects (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Concept
- Sub-category
- State Side Effect Concept
- Confidence
- 0.74
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Effects are increasingly listed in modern frontend/state-management JDs and docs (e.g., React/Redux side-effect handling, RxJS, Effector), but there is no single universal standard or dominant hiring staple yet.
Skill profile (library / DB)
- Skill nature
- TOOL
- Volatility
- EMERGING
- Typical lifespan
- EVERGREEN
- Category id
- 11
- Sub-category id
- 2151
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Model Serving Deployment and Runtime Packaging Catalog dimension db id 52
Library dimension (catalog)
Roles linked in library: MLOps Engineer, Machine Learning Engineer
-
Project Delivery and Coordination Catalog dimension db id 366
Library dimension (catalog)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Model Serving Deployment and Runtime Packaging
model-serving-deployment-and-runtime-packaging
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Project Delivery and Coordination
d_init_02
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Version Control Systems
d_init_01
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — catalog
- Forcepoint (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Platform
- Sub-category
- Data Security Platform
- Vendor
- Forcepoint
- License
- proprietary
- Year introduced
- 2016
- Confidence
- 0.95
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Forcepoint appears in some security/data-loss-prevention job postings, but JD volume is far below mainstream platforms like Microsoft Purview or Palo Alto; it’s a specialized enterprise tool rather than a broad hiring staple.
Skill profile (library / DB)
- Skill nature
- PLATFORM
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 13
- Sub-category id
- 326
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Cloud ML Platform Operations Catalog dimension db id 65
Library dimension (catalog)
Roles linked in library: MLOps Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Cloud ML Platform Operations
cloud-ml-platform-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — catalog
- Sign in with Apple (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Service
- Sub-category
- Identity Service
- Vendor
- Apple
- License
- proprietary
- Year introduced
- 2019
- Confidence
- 0.90
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Commonly listed in mobile/web auth JDs for iOS apps and Apple ecosystem integrations; Apple’s official docs and App Store requirements keep it a standard identity option rather than a niche add-on.
Skill profile (library / DB)
- Skill nature
- CLOUD_SERVICE
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 14
- Sub-category id
- 385
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Cloud Data Platform Services Catalog dimension db id 81
Library dimension (catalog)
Roles linked in library: Data Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Cloud Data Platform Services
cloud-data-platform-services
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved |
Aliases — catalog
- data classification (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Methodology
- Sub-category
- Data Governance Methodology
- Confidence
- 0.88
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: Common in security/compliance JDs and vendor docs (e.g., Microsoft Purview, AWS Macie) as a core data-governance control for labeling and handling sensitive data.
Skill profile (library / DB)
- Skill nature
- PLATFORM
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 13
- Sub-category id
- 323
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Cloud ML Platform Operations Catalog dimension db id 65
Library dimension (catalog)
Roles linked in library: MLOps Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Cloud ML Platform Operations
cloud-ml-platform-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Notebook environments (e.g., Jupyter) appear in many data science and ML job descriptions and are a standard workflow in major cloud vendors’ managed notebook offerings.
Project Jupyter ·bsd ·since 2014 (0.93)
Could be confused with: jupyter_notebook, colab
“Notebooks” is a generic term and in JDs could mean Jupyter notebooks or Google Colab, both common catalog skills. The standalone name is too broad to be unambiguous.
Not versioned
Tool ·notebook_environment confidence 0.90
Notebooks are software you operate to write and run analyses interactively, so by the Tool vs Framework rule they are best classified as a tool rather than a framework or platform.
- Category
- Tool
- Sub-category
- notebook_environment
- Skill nature
- TOOL
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Analytical Programming and Notebook-Based Data Analysis Proposed / LLM
Proposed / LLM dimension (no DB id yet)
Locked dimensions (v3 placement)
-
Analytical Programming and Notebook-Based Data Analysis
Pipeline tentative id
Languages and notebook-friendly coding used to clean, transform, analyze, and prototype data and model workflows. This includes Python, R, SQL, and Scala used in notebooks or scripts for data wrangling, exploratory data analysis, statistical logic, feature engineering, and reproducible prototyping. It excludes production orchestration and scheduling, dashboard/report authoring, model deployment packaging, database administration, and UI development.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Analytical Programming and Notebook-Based Data Analysis
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — from this run (catalog unavailable)
- CI/CD (CANONICAL)
Skill profile (library / DB)
- Skill nature
- METHODOLOGY
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 7
- Sub-category id
- 2102
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Version Control Systems
d_init_01
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Common in data/platform job descriptions across industries; JD volume remains high for Hadoop/Spark/streaming stacks, and cloud vendors market managed big-data services as standard offerings.
(0.99)
“Big Data” is a well-established domain term with a specific meaning in JDs. It is unlikely to be reasonably confused with another catalog skill in typical extraction contexts.
Not versioned
Domain ·data_intensive_computing confidence 0.93
Big Data is a vertical/problem-space body of knowledge rather than a tool, framework, or architecture, so it fits the Domain rule.
- Category
- Domain
- Sub-category
- data_intensive_computing
- Skill nature
- CONCEPT
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
-
Messaging and Event Streaming Catalog dimension db id 146
Library dimension (catalog)
Roles linked in library: Backend Engineer
-
Messaging and Event Streaming Catalog dimension db id 146
Library dimension (catalog)
Roles linked in library: Backend Engineer
Locked dimensions (v3 placement)
-
Big Data Processing
Pipeline tentative id
Large-scale data processing systems and techniques for storing, transforming, and analyzing high-volume, high-velocity datasets. Big Data belongs here because the term usually refers to the distributed data engineering stack rather than a single tool.
-
Messaging and Event Streaming
Reuses catalog slug
Asynchronous data movement and event-driven pipelines used to feed large-scale analytics systems. Big Data often overlaps with this area when the skill is used in streaming ingestion or pipeline orchestration.
-
Messaging and Event Streaming
Reuses catalog slug
Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Messaging and Event Streaming
messaging-and-event-streaming
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Data modeling appears in many data engineer, DBA, and analytics JDs, and is a standard prerequisite alongside SQL and database design rather than a niche specialty.
(0.99)
“Data Modeling” is a standard, well-scoped concept in JDs and is unlikely to be confused with a different catalog skill in typical usage.
Not versioned
Concept ·data_modeling confidence 0.96
Data Modeling is fundamentally a knowledge unit about how to structure and relate data, so by the Concept vs Methodology rule it is a Concept rather than a process or tool.
- Category
- Concept
- Sub-category
- data_modeling
- Skill nature
- CONCEPT
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
Locked dimensions (v3 placement)
-
Data Modeling
Pipeline tentative id
Designing the logical and physical structure of data so it is consistent, queryable, and fit for downstream analytics or operational use. This belongs here because the skill centers on defining entities, relationships, keys, and schemas rather than storage tuning or pipeline execution.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Data pipelines are a common requirement in cloud/data engineering JDs, with frequent mentions alongside Airflow, Spark, and ETL/ELT stacks; broad hiring demand signals mainstream adoption.
(0.99)
“Data Pipelines” is a fairly specific architecture term and is unlikely to be mistaken for a different catalog skill in a typical JD.
Not versioned
Architecture ·data_pipeline_architecture confidence 0.90
By the Architecture vs Concept rule, data pipelines describe a system-shape pattern for moving and transforming data across stages rather than a single knowledge unit or tool.
- Category
- Architecture
- Sub-category
- data_pipeline_architecture
- Skill nature
- PATTERN
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Inference Data Pipelines for Serving and Batch Scoring Proposed / LLM
Proposed / LLM dimension (no DB id yet)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
Locked dimensions (v3 placement)
-
Inference Data Pipelines for Serving and Batch Scoring
Pipeline tentative id
Operational data movement that prepares and delivers timely, reliable data to production inference systems. Includes batch scoring inputs, feature refresh jobs, inference-time preprocessing, scheduled extracts, data validation for serving, and online/offline feature synchronization. Excludes training dataset curation, model training workflows, experimentation-focused feature engineering, model evaluation, and serving infrastructure/routing.
-
Data Pipeline Orchestration
Pipeline tentative id
Designing, scheduling, and coordinating end-to-end data movement and transformation jobs. This is the best fit when Data Pipelines refers to building reliable multi-step workflows across sources, transforms, and sinks.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Inference Data Pipelines for Serving and Batch Scoring
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) |
|
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Commonly appears in data/platform job descriptions and cloud vendor docs as a core pipeline capability; often paired with ETL/ELT, Kafka, and Airflow rather than treated as a niche specialty.
(0.99)
“Data Ingestion” is a standard, specific concept in data engineering and is unlikely to be mistaken for a different catalog skill in typical job descriptions.
Not versioned
Concept ·data_ingestion confidence 0.93
Data Ingestion is fundamentally a named knowledge unit about bringing data into systems, so it fits the Concept category rather than a tool, platform, or methodology.
- Category
- Concept
- Sub-category
- data_ingestion
- Skill nature
- CONCEPT
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Asynchronous Messaging and Event Streaming Proposed / LLM
Proposed / LLM dimension (no DB id yet)
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
Locked dimensions (v3 placement)
-
Asynchronous Messaging and Event Streaming
Pipeline tentative id
Covers asynchronous communication and data movement through queues, topics, streams, event buses, and pub/sub systems for decoupled processing, background jobs, and event-driven integration. Includes continuous or event-driven data ingestion and change data capture pipelines, but excludes batch ETL orchestration, warehouse modeling, query optimization, model training data prep, and direct application API calls.
-
Batch Data Ingestion Pipelines
Pipeline tentative id
Covers scheduled or bulk loading of data from files, databases, and external systems into analytical or operational stores. Data Ingestion fits here when the emphasis is on landing, validating, and loading datasets rather than streaming transport.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Asynchronous Messaging and Event Streaming
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) |
|
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Common in JDs for Kafka/Flink/Spark Streaming and cloud services like Kinesis/Pub/Sub; broad market adoption for real-time event pipelines.
(0.99)
The term is fairly specific in JDs and usually refers to event/data stream processing architecture, not a different catalog skill. It is unlikely to be confused with another skill name in typical job descriptions.
Not versioned
Architecture ·stream_processing_architecture confidence 0.90
Stream Processing is fundamentally a system-shape for handling continuous event flows, so by the Architecture vs Concept rule it fits Architecture rather than a tool or methodology.
- Category
- Architecture
- Sub-category
- stream_processing_architecture
- Skill nature
- PATTERN
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Messaging and Event Streaming Catalog dimension db id 146
Library dimension (catalog)
Roles linked in library: Backend Engineer
Locked dimensions (v3 placement)
-
Stream Processing
Reuses catalog slug
Processing continuous event data as it arrives, using stream processors, windows, and stateful operators to transform and route records in near real time. This belongs here because stream processing is the core execution model for event-driven pipelines and low-latency data movement.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Messaging and Event Streaming
messaging-and-event-streaming
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Queueing theory is a standard CS/ops concept and appears in many systems, SRE, and performance-engineering job descriptions; it is not a sunset technology and remains a common interview/topic area.
(0.99)
Queueing is a fairly specific operations-research concept; in typical JDs it is unlikely to be mistaken for a different catalog skill.
Not versioned
Concept ·queueing_theory confidence 0.93
Queueing is fundamentally a knowledge unit about how waiting lines and work distribution behave, so by the Concept vs Methodology rule it is a Concept rather than an Architecture or Tool.
- Category
- Concept
- Sub-category
- queueing_theory
- Skill nature
- CONCEPT
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Messaging, Queueing, and Event Streaming Proposed / LLM
Proposed / LLM dimension (no DB id yet)
Locked dimensions (v3 placement)
-
Messaging, Queueing, and Event Streaming
Pipeline tentative id
Asynchronous communication patterns and systems that decouple producers and consumers, buffer and route work items, and support background processing and service-to-service integration. Includes queueing, message queues, pub/sub, brokers, topics, consumer groups, producers/consumers, dead-letter queues, retry handling, backpressure, and event streaming platforms such as Kafka, RabbitMQ, SQS, and Azure Service Bus.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Messaging, Queueing, and Event Streaming
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
APIs are a hiring-pipeline staple across backend, mobile, and platform JDs; REST/GraphQL/API design appears in large-volume job postings and cloud vendor docs.
(0.99)
“APIs” is a standard, widely used term in JDs and usually refers unambiguously to application programming interfaces; it is not typically confused with a distinct catalog skill.
Not versioned
Protocol ·application_programming_interfaces confidence 0.91
APIs are a communication interface standard between systems, so by the Protocol vs Standard rule they fit best as a Protocol rather than a tool or platform.
- Category
- Protocol
- Sub-category
- application_programming_interfaces
- Skill nature
- PROTOCOL
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
API Integration, Request Orchestration, and Data Fetching Proposed / LLM
Proposed / LLM dimension (no DB id yet)
-
Cloud Service Integration Patterns Catalog dimension db id 188
Library dimension (catalog)
Roles linked in library: Cloud Architect
-
Version Control Systems Catalog dimension db id 365
Library dimension (catalog)
-
Cloud Service Integration Patterns Catalog dimension db id 188
Library dimension (catalog)
Roles linked in library: Cloud Architect
Locked dimensions (v3 placement)
-
API Integration, Request Orchestration, and Data Fetching
Pipeline tentative id
Connecting applications to internal or external services through request/response APIs. This includes consuming REST and GraphQL endpoints, orchestrating requests, handling payloads and response parsing, pagination, retries, error handling, and shaping remote data for downstream or UI consumption.
-
Cloud Service Integration Patterns
Reuses catalog slug
How services connect across boundaries using APIs, events, and shared interfaces. The target skill belongs here when APIs are treated as an integration mechanism between cloud services, pipelines, or platforms.
-
API Design and Specification
Pipeline tentative id
Defining API contracts, resource models, and request/response semantics for services. This dimension fits the target skill when APIs refers to designing or documenting interfaces rather than merely consuming them.
-
Cloud Service Integration Patterns
Reuses catalog slug
Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
API Integration, Request Orchestration, and Data Fetching
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) |
|
Cloud Service Integration Patterns
cloud-service-integration-patterns
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Skill enrichment (orchestrator / LLM)
Testability is a common requirement in software engineering JDs and interview rubrics, often paired with unit/integration testing, CI, and TDD; it’s a standard quality attribute rather than a niche tool.
(0.99)
“Testability” is a specific software engineering concept and is unlikely to be mistaken for a different catalog skill in typical job descriptions.
Not versioned
Concept ·software_testability_concept confidence 0.97
By the Concept vs Methodology rule, testability is a named knowledge unit about how easily software can be tested, not a process or tool.
- Category
- Concept
- Sub-category
- software_testability_concept
- Skill nature
- CONCEPT
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Version strategy
- NOT_APPLICABLE
Dimensions (API 2 worklist)
-
Testing and Validation Practices Catalog dimension db id 221
Library dimension (catalog)
Roles linked in library: ServiceNOW Developer
-
Testing and Validation Practices Catalog dimension db id 221
Library dimension (catalog)
Roles linked in library: ServiceNOW Developer
Locked dimensions (v3 placement)
-
Testing and Validation Practices
Reuses catalog slug
Practices for verifying that software changes behave correctly before release, including test design, regression checks, and validation workflows. Testability belongs here because it describes how easily a system can be exercised and verified by tests.
-
Testing and Validation Practices
Reuses catalog slug
Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Testing and Validation Practices
testing-and-validation-practices
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Aliases — catalog
- FormBuilder (CANONICAL) primary
Context tags (catalog)
Stored enrichment (catalog DB)
- Category
- Library
- Sub-category
- Forms Helper Library
- Vendor
- null
- License
- unknown
- Confidence
- 0.88
- Version strategy
- NOT_APPLICABLE
Maturity reasoning: FormBuilder appears in relatively low JD volume compared with mainstream form stacks; market usage is mostly in legacy/admin app codebases rather than broad hiring pipelines.
Skill profile (library / DB)
- Skill nature
- METHODOLOGY
- Volatility
- STABLE
- Typical lifespan
- EVERGREEN
- Category id
- 7
- Sub-category id
- 2156
- Extractable
- True
- Also category
- False
Dimensions (API 2 worklist)
-
Inference Data Pipelines Catalog dimension db id 59
Library dimension (catalog)
Roles linked in library: MLOps Engineer
-
Model Serving Deployment and Runtime Packaging Catalog dimension db id 52
Library dimension (catalog)
Roles linked in library: MLOps Engineer, Machine Learning Engineer
API 3 link attempts (this skill)
| Dimension | Skill↔dim | Role↔dim | Outcome |
|---|---|---|---|
|
Inference Data Pipelines
inference-data-pipelines
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
|
Model Serving Deployment and Runtime Packaging
model-serving-deployment-and-runtime-packaging
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
All API 3 persistence rows
Same grid as the skill-extractor “Persistence items” table: one row per (skill × dimension) work item.
| Skill | Tag | Dimension | Skill↔dim | Role↔dim | Outcome | Notes |
|---|---|---|---|---|---|---|
| Azure | in_db |
Cloud Platform Operations
cloud-platform-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Azure | in_db |
Cloud Security Platforms
cloud-security-platforms
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Analytical Programming Languages
analytical-programming-languages
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Automation Scripting and CLI
automation-scripting-and-cli
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Automation and Scripting for Operations
automation-and-scripting-for-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Network Automation and Scripting
network-automation-and-scripting
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Programming Languages for AI Workflows
programming-languages-for-ai-workflows
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Programming Languages for Backend Systems
programming-languages-for-backend-systems
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Programming Languages for Data Work
programming-languages-for-data-work
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved | |
| Python | in_db |
Programming Languages for ML Systems
programming-languages-for-ml-systems
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Programming Languages for Security Work
programming-languages-for-security-work
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Programming Languages for Test Automation
programming-languages-for-test-automation
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Python | in_db |
Security Automation and Scripting
security-automation-and-scripting
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| SQL | in_db |
Relational Data Modeling
relational-data-modeling
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved | |
| SQL | in_db |
Version Control Systems
d_init_01
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| MLflow | in_db |
Model Serving Deployment and Runtime Packaging
model-serving-deployment-and-runtime-packaging
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| MLflow | in_db |
Project Delivery and Coordination
d_init_02
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| MLflow | in_db |
Version Control Systems
d_init_01
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Azure Machine Learning | in_db |
Cloud ML Platform Operations
cloud-ml-platform-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Azure Data Factory | in_db |
Cloud Data Platform Services
cloud-data-platform-services
|
✓ | ✓ | Existing dimension (library) · Role↔dimension saved | |
| Databricks | in_db |
Cloud ML Platform Operations
cloud-ml-platform-operations
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| CI/CD | in_db |
Version Control Systems
d_init_01
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| MLOps | in_db |
Inference Data Pipelines
inference-data-pipelines
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| MLOps | in_db |
Model Serving Deployment and Runtime Packaging
model-serving-deployment-and-runtime-packaging
|
✓ | — | Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| PySpark | in_db |
Analytical Programming and Notebook Languages
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) | |
| PySpark | in_db |
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Notebooks | in_db |
Analytical Programming and Notebook-Based Data Analysis
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) | |
| Big Data | in_db |
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Big Data | in_db |
Messaging and Event Streaming
messaging-and-event-streaming
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Data Modeling | in_db |
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Data Pipelines | in_db |
Inference Data Pipelines for Serving and Batch Scoring
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) | |
| Data Pipelines | in_db |
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Data Ingestion | in_db |
Asynchronous Messaging and Event Streaming
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) | |
| Data Ingestion | in_db |
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Stream Processing | in_db |
Messaging and Event Streaming
messaging-and-event-streaming
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Queueing | in_db |
Messaging, Queueing, and Event Streaming
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) | |
| APIs | in_db |
API Integration, Request Orchestration, and Data Fetching
d_merge_01
|
✓ | — | New skill saved · Existing dimension (reconciliation merge) · Role↔dimension skipped (dimension not under chosen role) | |
| APIs | in_db |
Cloud Service Integration Patterns
cloud-service-integration-patterns
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| APIs | in_db |
Version Control Systems
d_init_01
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) | |
| Testability | in_db |
Testing and Validation Practices
testing-and-validation-practices
|
✓ | — | New skill saved · Existing dimension (library) · Role↔dimension skipped (dimension not under chosen role) |
Library artifacts (this run)
| Kind | Detail | DB id |
|---|---|---|
| canonical_skill_added | PySpark | 2684 |
| canonical_skill_added | Notebooks | 2685 |
| canonical_skill_added | Big Data | 2686 |
| canonical_skill_added | Data Modeling | 2687 |
| canonical_skill_added | Data Pipelines | 2688 |
| canonical_skill_added | Data Ingestion | 2689 |
| canonical_skill_added | Stream Processing | 2690 |
| canonical_skill_added | Queueing | 2691 |
| canonical_skill_added | APIs | 2692 |
| canonical_skill_added | Testability | 2693 |
| dimension_skill_link | PySpark ↔ Analytical Programming and Notebook Languages | 82 |
| dimension_skill_link | PySpark ↔ Version Control Systems | 365 |
| dimension_skill_link | Notebooks ↔ Analytical Programming and Notebook-Based Data Analysis | 82 |
| dimension_skill_link | Big Data ↔ Version Control Systems | 365 |
| dimension_skill_link | Big Data ↔ Messaging and Event Streaming | 146 |
| dimension_skill_link | Data Modeling ↔ Version Control Systems | 365 |
| dimension_skill_link | Data Pipelines ↔ Inference Data Pipelines for Serving and Batch Scoring | 59 |
| dimension_skill_link | Data Pipelines ↔ Version Control Systems | 365 |
| dimension_skill_link | Data Ingestion ↔ Asynchronous Messaging and Event Streaming | 146 |
| dimension_skill_link | Data Ingestion ↔ Version Control Systems | 365 |
| dimension_skill_link | Stream Processing ↔ Messaging and Event Streaming | 146 |
| dimension_skill_link | Queueing ↔ Messaging, Queueing, and Event Streaming | 146 |
| dimension_skill_link | APIs ↔ API Integration, Request Orchestration, and Data Fetching | 9 |
| dimension_skill_link | APIs ↔ Cloud Service Integration Patterns | 188 |
| dimension_skill_link | APIs ↔ Version Control Systems | 365 |
| dimension_skill_link | Testability ↔ Testing and Validation Practices | 221 |
nano JD Parser — gpt-4.1-nano click to toggle
Show raw JSON
{
"JD_type": "pass",
"about_company": null,
"certifications": [],
"company_name": "The corporation",
"ctc": null,
"domain": {
"primary": {
"aliases": [],
"domain": "Other"
},
"secondary": null
},
"education": [
{
"level": "Bachelor\u0027s",
"qualification": "BTECH/BE/BSC - Computer Science (or equivalent)",
"raw": "Bachelors degree or higher in Computer Science or equivalent degree",
"requirement": "required"
}
],
"experience": {
"max": 10,
"min": 3,
"raw": "3 to 10 years related working experience"
},
"job_locations": [],
"role": "big data engineers",
"role_archetype": "Data",
"roles_and_responsibilities": [
{
"bullet_count": 0,
"heading": "Job Responsibilities",
"heading_was_present": true,
"source_marker": {
"first_5_words": "Big data design and analysis",
"last_5_words": "data science projects"
},
"text": "Big data design and analysis data modeling development deployment and CICD operations of big data pipelines\n\nCollaborate with a team of data engineers data scientists and business subject matter experts to process data and prepare data sources\n\nMentor other data engineers to develop a world class data engineering team\n\nIngest process and model data from heterogeneous data sources to support data science projects",
"word_count": 66
},
{
"bullet_count": 0,
"heading": "Basic Qualifications",
"heading_was_present": true,
"source_marker": {
"first_5_words": "Bachelors degree or higher in",
"last_5_words": "structuring code for testability"
},
"text": "Bachelors degree or higher in Computer Science or equivalent degree and 3 to 10 years related working experience\n\nIn depth experience with a big data cloud platform preferably Azure\n\nStrong grasp of programming languages such as Python PySpark or equivalent and willingness to learn new ones\n\nExperience writing database heavy services or APIs\n\nExperience building and optimizing data pipelines architectures and data sets\n\nWorking knowledge of queueing stream processing and highly scalable data stores\n\nExperience working with and supporting cross functional teams\n\nStrong understanding of structuring code for testability",
"word_count": 104
},
{
"bullet_count": 0,
"heading": "Preferred Qualifications",
"heading_was_present": true,
"source_marker": {
"first_5_words": "Professional experience implementing and",
"last_5_words": "with cross functional teams"
},
"text": "Professional experience implementing and maintaining MLOps pipelines in MLflow or AzureML\n\nProfessional experience implementing data ingestion pipelines using Data Factory\n\nProfessional experience with Databricks and coding with notebooks\n\nProfessional experience processing and manipulating data using SQL and Python\n\nProfessional experience with user training customer support and coordination with cross functional teams",
"word_count": 66
}
],
"urls": []
}
API 1 — extract-from-jd click to toggle
{
"final_skills": [
{
"is_primary": true,
"skill_name": "Azure"
},
{
"is_primary": true,
"skill_name": "Python"
},
{
"is_primary": true,
"skill_name": "PySpark"
},
{
"is_primary": true,
"skill_name": "SQL"
},
{
"is_primary": true,
"skill_name": "MLflow"
},
{
"is_primary": true,
"skill_name": "Azure Machine Learning"
},
{
"is_primary": true,
"skill_name": "Azure Data Factory"
},
{
"is_primary": true,
"skill_name": "Databricks"
},
{
"is_primary": false,
"skill_name": "Notebooks"
},
{
"is_primary": true,
"skill_name": "CI/CD"
},
{
"is_primary": true,
"skill_name": "Big Data"
},
{
"is_primary": true,
"skill_name": "Data Modeling"
},
{
"is_primary": true,
"skill_name": "Data Pipelines"
},
{
"is_primary": true,
"skill_name": "Data Ingestion"
},
{
"is_primary": true,
"skill_name": "Stream Processing"
},
{
"is_primary": false,
"skill_name": "Queueing"
},
{
"is_primary": true,
"skill_name": "APIs"
},
{
"is_primary": false,
"skill_name": "Testability"
},
{
"is_primary": true,
"skill_name": "MLOps"
}
],
"jd_role": {
"display_name": "big data engineers",
"rationale": null,
"role_archetype": "Data",
"slug": ""
},
"nano_parsed": {
"JD_type": "pass",
"about_company": null,
"certifications": [],
"company_name": "The corporation",
"ctc": null,
"domain": {
"primary": {
"aliases": [],
"domain": "Other"
},
"secondary": null
},
"education": [
{
"level": "Bachelor\u0027s",
"qualification": "BTECH/BE/BSC - Computer Science (or equivalent)",
"raw": "Bachelors degree or higher in Computer Science or equivalent degree",
"requirement": "required"
}
],
"experience": {
"max": 10,
"min": 3,
"raw": "3 to 10 years related working experience"
},
"job_locations": [],
"role": "big data engineers",
"role_archetype": "Data",
"roles_and_responsibilities": [
{
"bullet_count": 0,
"heading": "Job Responsibilities",
"heading_was_present": true,
"source_marker": {
"first_5_words": "Big data design and analysis",
"last_5_words": "data science projects"
},
"text": "Big data design and analysis data modeling development deployment and CICD operations of big data pipelines\n\nCollaborate with a team of data engineers data scientists and business subject matter experts to process data and prepare data sources\n\nMentor other data engineers to develop a world class data engineering team\n\nIngest process and model data from heterogeneous data sources to support data science projects",
"word_count": 66
},
{
"bullet_count": 0,
"heading": "Basic Qualifications",
"heading_was_present": true,
"source_marker": {
"first_5_words": "Bachelors degree or higher in",
"last_5_words": "structuring code for testability"
},
"text": "Bachelors degree or higher in Computer Science or equivalent degree and 3 to 10 years related working experience\n\nIn depth experience with a big data cloud platform preferably Azure\n\nStrong grasp of programming languages such as Python PySpark or equivalent and willingness to learn new ones\n\nExperience writing database heavy services or APIs\n\nExperience building and optimizing data pipelines architectures and data sets\n\nWorking knowledge of queueing stream processing and highly scalable data stores\n\nExperience working with and supporting cross functional teams\n\nStrong understanding of structuring code for testability",
"word_count": 104
},
{
"bullet_count": 0,
"heading": "Preferred Qualifications",
"heading_was_present": true,
"source_marker": {
"first_5_words": "Professional experience implementing and",
"last_5_words": "with cross functional teams"
},
"text": "Professional experience implementing and maintaining MLOps pipelines in MLflow or AzureML\n\nProfessional experience implementing data ingestion pipelines using Data Factory\n\nProfessional experience with Databricks and coding with notebooks\n\nProfessional experience processing and manipulating data using SQL and Python\n\nProfessional experience with user training customer support and coordination with cross functional teams",
"word_count": 66
}
],
"urls": []
},
"run_id": null
}
API 2 — extract-details
{
"alias_matches": [
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 349,
"existing_alias_text": "Azure",
"input_term": "Azure",
"matched_canonical": {
"category_id": 13,
"display_name": "Azure",
"id": 164,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "azure",
"sub_category_id": 161,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 608,
"existing_alias_text": "Python",
"input_term": "Python",
"matched_canonical": {
"category_id": 5,
"display_name": "Python",
"id": 393,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "LANGUAGE",
"slug": "python",
"sub_category_id": 54,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 3398,
"existing_alias_text": "SQL",
"input_term": "SQL",
"matched_canonical": {
"category_id": 5,
"display_name": "SQL",
"id": 2601,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "LANGUAGE",
"slug": "sql",
"sub_category_id": 55,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 3593,
"existing_alias_text": "MLflow",
"input_term": "MLflow",
"matched_canonical": {
"category_id": 11,
"display_name": "MLflow",
"id": 2640,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "TOOL",
"slug": "mlflow",
"sub_category_id": 2151,
"typical_lifespan": "EVERGREEN",
"volatility": "EMERGING"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 600,
"existing_alias_text": "Azure Machine Learning",
"input_term": "Azure Machine Learning",
"matched_canonical": {
"category_id": 13,
"display_name": "Azure Machine Learning",
"id": 385,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "azure-machine-learning",
"sub_category_id": 326,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 731,
"existing_alias_text": "Azure Data Factory",
"input_term": "Azure Data Factory",
"matched_canonical": {
"category_id": 14,
"display_name": "Azure Data Factory",
"id": 467,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "CLOUD_SERVICE",
"slug": "azure-data-factory",
"sub_category_id": 385,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 601,
"existing_alias_text": "Databricks",
"input_term": "Databricks",
"matched_canonical": {
"category_id": 13,
"display_name": "Databricks",
"id": 386,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "databricks",
"sub_category_id": 323,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 3376,
"existing_alias_text": "CI/CD",
"input_term": "CI/CD",
"matched_canonical": {
"category_id": 7,
"display_name": "CI/CD",
"id": 2579,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "METHODOLOGY",
"slug": "ci-cd",
"sub_category_id": 2102,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
},
{
"alias_persist_skipped_reason": "alias_text already exists for this canonical skill",
"alias_persisted": false,
"existing_alias_id": 3600,
"existing_alias_text": "MLOps",
"input_term": "MLOps",
"matched_canonical": {
"category_id": 7,
"display_name": "MLOps",
"id": 2643,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "METHODOLOGY",
"slug": "mlops",
"sub_category_id": 2156,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"matched_via": "alias"
}
],
"candidate_roles": [
{
"display_name": "DevOps Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A DevOps Engineer enables reliable, repeatable delivery of software by designing and operating the processes that connect development and production. They focus on improving deployment flow, operational stability, and collaboration between teams through automation, standardization, and monitoring of delivery and runtime practices.",
"slug": "devops-engineer",
"source": "db"
},
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
},
{
"display_name": "Data Analyst",
"id": 20,
"rationale": null,
"role_archetype": null,
"slug": "data-analyst",
"source": "db"
},
{
"display_name": "Data Scientist",
"id": 7,
"rationale": null,
"role_archetype": null,
"slug": "data-scientist",
"source": "db"
},
{
"display_name": "Azure Cloud Engineer",
"id": 4,
"rationale": null,
"role_archetype": null,
"slug": "azure-cloud-engineer",
"source": "db"
},
{
"display_name": "Cloud Engineer",
"id": 18,
"rationale": null,
"role_archetype": null,
"slug": "cloud-engineer",
"source": "db"
},
{
"display_name": "Virtualization Engineer",
"id": 26,
"rationale": null,
"role_archetype": null,
"slug": "virtualization-engineer",
"source": "db"
},
{
"display_name": "Network Engineer",
"id": 21,
"rationale": null,
"role_archetype": null,
"slug": "network-engineer",
"source": "db"
},
{
"display_name": "AI Engineer",
"id": 12,
"rationale": null,
"role_archetype": null,
"slug": "ai-engineer",
"source": "db"
},
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
},
{
"display_name": "Automation Tester",
"id": 16,
"rationale": null,
"role_archetype": null,
"slug": "automation-tester",
"source": "db"
},
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Cloud Architect",
"id": 11,
"rationale": null,
"role_archetype": null,
"slug": "cloud-architect",
"source": "db"
},
{
"display_name": "ServiceNOW Developer",
"id": 24,
"rationale": null,
"role_archetype": null,
"slug": "servicenow-developer",
"source": "db"
}
],
"chosen_role": {
"display_name": "Data Engineer",
"id": 6,
"rationale": "The primary skills indicate a strong focus on data processing, SQL, and Azure technologies, aligning well with a Data Engineer\u0027s responsibilities.",
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Platform Operations",
"id": 26,
"rationale": "Uses cloud provider services to support delivery and runtime environments. The focus is on consumer-level operation of cloud services rather than deep cloud architecture ownership.",
"slug": "cloud-platform-operations",
"source": "db"
},
"input_skill": "Azure",
"llm_role": null,
"roles_from_db": [
{
"display_name": "DevOps Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A DevOps Engineer enables reliable, repeatable delivery of software by designing and operating the processes that connect development and production. They focus on improving deployment flow, operational stability, and collaboration between teams through automation, standardization, and monitoring of delivery and runtime practices.",
"slug": "devops-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Security Platforms",
"id": 332,
"rationale": "Cloud-native security products used to assess posture, detect misconfigurations, and monitor workloads across AWS, Azure, and GCP. This is a distinct product family because the role often works across multiple CNAPP/CSPM/CWPP offerings and cloud-native detectors.",
"slug": "cloud-security-platforms",
"source": "db"
},
"input_skill": "Azure",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Analytical Programming Languages",
"id": 82,
"rationale": "Languages used to clean, transform, analyze, and prototype models in notebooks and scripts. This is the core coding surface for expressing statistical logic and data manipulation in a reproducible way.",
"slug": "analytical-programming-languages",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Analyst",
"id": 20,
"rationale": null,
"role_archetype": null,
"slug": "data-analyst",
"source": "db"
},
{
"display_name": "Data Scientist",
"id": 7,
"rationale": null,
"role_archetype": null,
"slug": "data-scientist",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Automation Scripting and CLI",
"id": 48,
"rationale": "Uses scripts and command-line tooling to execute repeatable Azure operations and reduce manual work. This is a practical cluster because the role frequently automates provisioning, checks, and remediation tasks.",
"slug": "automation-scripting-and-cli",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Azure Cloud Engineer",
"id": 4,
"rationale": null,
"role_archetype": null,
"slug": "azure-cloud-engineer",
"source": "db"
},
{
"display_name": "Cloud Engineer",
"id": 18,
"rationale": null,
"role_archetype": null,
"slug": "cloud-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Automation and Scripting for Operations",
"id": 361,
"rationale": "Scripts and lightweight automation used to execute repetitive virtualization tasks and enforce operational consistency. This is the practical glue that reduces manual host and VM administration.",
"slug": "automation-and-scripting-for-operations",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Virtualization Engineer",
"id": 26,
"rationale": null,
"role_archetype": null,
"slug": "virtualization-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Network Automation and Scripting",
"id": 285,
"rationale": "Covers scripts and automation used to configure, validate, and audit network devices and services. This cluster is coherent because repeatable network operations increasingly depend on programmatic changes and checks.",
"slug": "network-automation-and-scripting",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Network Engineer",
"id": 21,
"rationale": null,
"role_archetype": null,
"slug": "network-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for AI Workflows",
"id": 261,
"rationale": "Languages used to implement AI feature logic, orchestration, and response handling inside product code. This is the core coding surface for turning prompts and model calls into reliable application behavior.",
"slug": "programming-languages-for-ai-workflows",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "AI Engineer",
"id": 12,
"rationale": null,
"role_archetype": null,
"slug": "ai-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Backend Systems",
"id": 140,
"rationale": "Languages used to implement server-side business logic, request handlers, workers, and service integrations. This is the core coding surface for backend feature delivery and maintenance.",
"slug": "programming-languages-for-backend-systems",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Data Work",
"id": 67,
"rationale": "Languages used to implement data pipelines, transformations, and operational utilities. This is the code layer for expressing extraction, parsing, validation, and orchestration logic in data engineering workflows.",
"slug": "programming-languages-for-data-work",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for ML Systems",
"id": 113,
"rationale": "Languages used to implement model integration code, inference services, and feature-processing logic. This is the core coding surface for turning trained models into product-facing software components.",
"slug": "programming-languages-for-ml-systems",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Security Work",
"id": 328,
"rationale": "Languages used to automate security tasks, write detection logic, and build analysis or remediation tooling. This is the core coding surface for a cybersecurity engineer across scripts, queries, and small utilities.",
"slug": "programming-languages-for-security-work",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Test Automation",
"id": 193,
"rationale": "Languages used to implement automated checks, helper utilities, and test harness code. This is the core coding surface for turning test ideas into maintainable automation.",
"slug": "programming-languages-for-test-automation",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Automation Tester",
"id": 16,
"rationale": null,
"role_archetype": null,
"slug": "automation-tester",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Security Automation and Scripting",
"id": 258,
"rationale": "Automating repeatable security checks, enrichment, and remediation workflows. This cluster is coherent because the role often needs lightweight automation to scale analysis and response.",
"slug": "security-automation-and-scripting",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Relational Data Modeling",
"id": 71,
"rationale": "Designing tables, relationships, constraints, and transactional data shapes for operational backend systems. This cluster is coherent because backend services frequently own the canonical application data model.",
"slug": "relational-data-modeling",
"source": "db"
},
"input_skill": "SQL",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "SQL",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Model Serving Deployment and Runtime Packaging",
"id": 52,
"rationale": "Operational deployment of trained models into online, batch, or streaming serving environments, including packaging models and model servers into containers or managed inference runtimes, coordinating rollout, and handing off to inference systems. Covers serving frameworks and platforms such as TensorFlow Serving, TorchServe, Triton Inference Server, BentoML, KServe, and Seldon Core, plus container/runtime concerns like Docker images, GPU-enabled containers, base image selection, container entrypoints, runtime dependencies, and image scanning for model services.",
"slug": "model-serving-deployment-and-runtime-packaging",
"source": "db"
},
"input_skill": "MLflow",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Project Delivery and Coordination",
"id": 366,
"rationale": "Coordination practices for organizing work, tracking progress, and aligning stakeholders across a delivery effort. Agile fits here when used as a team execution framework for managing scope, cadence, and collaboration.",
"slug": "d_init_02",
"source": "db"
},
"input_skill": "MLflow",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "MLflow",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud ML Platform Operations",
"id": 65,
"rationale": "Consumer-level operation of managed ML services and cloud resources used to train and serve models. This covers the cloud platform surface that MLOps engineers use without owning the underlying cloud platform itself.",
"slug": "cloud-ml-platform-operations",
"source": "db"
},
"input_skill": "Azure Machine Learning",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Data Platform Services",
"id": 81,
"rationale": "Consumer-level use of cloud services that support data engineering workloads. This includes managed compute, storage, networking-adjacent services, and security primitives used to run pipelines and data platforms.",
"slug": "cloud-data-platform-services",
"source": "db"
},
"input_skill": "Azure Data Factory",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud ML Platform Operations",
"id": 65,
"rationale": "Consumer-level operation of managed ML services and cloud resources used to train and serve models. This covers the cloud platform surface that MLOps engineers use without owning the underlying cloud platform itself.",
"slug": "cloud-ml-platform-operations",
"source": "db"
},
"input_skill": "Databricks",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "CI/CD",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Inference Data Pipelines",
"id": 59,
"rationale": "Operational data movement for batch scoring, feature refresh, and inference-time data preparation. This is separate from model training because it focuses on getting the right data to the serving path reliably.",
"slug": "inference-data-pipelines",
"source": "db"
},
"input_skill": "MLOps",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Model Serving Deployment and Runtime Packaging",
"id": 52,
"rationale": "Operational deployment of trained models into online, batch, or streaming serving environments, including packaging models and model servers into containers or managed inference runtimes, coordinating rollout, and handing off to inference systems. Covers serving frameworks and platforms such as TensorFlow Serving, TorchServe, Triton Inference Server, BentoML, KServe, and Seldon Core, plus container/runtime concerns like Docker images, GPU-enabled containers, base image selection, container entrypoints, runtime dependencies, and image scanning for model services.",
"slug": "model-serving-deployment-and-runtime-packaging",
"source": "db"
},
"input_skill": "MLOps",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": null,
"display_name": "Analytical Programming and Notebook Languages",
"id": null,
"rationale": "Languages and notebook/script-based coding used to clean, transform, analyze, and prototype data workflows and models. Includes Python, pandas, SQL, PySpark, notebook scripting, dataframe manipulation, exploratory analysis, ETL/data transformation logic, and other reproducible analytical code.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "PySpark",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "PySpark",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": null,
"display_name": "Analytical Programming and Notebook-Based Data Analysis",
"id": null,
"rationale": "Languages and notebook-friendly coding used to clean, transform, analyze, and prototype data and model workflows. This includes Python, R, SQL, and Scala used in notebooks or scripts for data wrangling, exploratory data analysis, statistical logic, feature engineering, and reproducible prototyping. It excludes production orchestration and scheduling, dashboard/report authoring, model deployment packaging, database administration, and UI development.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Notebooks",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Big Data",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Big Data",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Big Data",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Data Modeling",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": null,
"display_name": "Inference Data Pipelines for Serving and Batch Scoring",
"id": null,
"rationale": "Operational data movement that prepares and delivers timely, reliable data to production inference systems. Includes batch scoring inputs, feature refresh jobs, inference-time preprocessing, scheduled extracts, data validation for serving, and online/offline feature synchronization. Excludes training dataset curation, model training workflows, experimentation-focused feature engineering, model evaluation, and serving infrastructure/routing.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Data Pipelines",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Data Pipelines",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": null,
"display_name": "Asynchronous Messaging and Event Streaming",
"id": null,
"rationale": "Covers asynchronous communication and data movement through queues, topics, streams, event buses, and pub/sub systems for decoupled processing, background jobs, and event-driven integration. Includes continuous or event-driven data ingestion and change data capture pipelines, but excludes batch ETL orchestration, warehouse modeling, query optimization, model training data prep, and direct application API calls.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Data Ingestion",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Data Ingestion",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Stream Processing",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": null,
"display_name": "Messaging, Queueing, and Event Streaming",
"id": null,
"rationale": "Asynchronous communication patterns and systems that decouple producers and consumers, buffer and route work items, and support background processing and service-to-service integration. Includes queueing, message queues, pub/sub, brokers, topics, consumer groups, producers/consumers, dead-letter queues, retry handling, backpressure, and event streaming platforms such as Kafka, RabbitMQ, SQS, and Azure Service Bus.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Queueing",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": null,
"display_name": "API Integration, Request Orchestration, and Data Fetching",
"id": null,
"rationale": "Connecting applications to internal or external services through request/response APIs. This includes consuming REST and GraphQL endpoints, orchestrating requests, handling payloads and response parsing, pagination, retries, error handling, and shaping remote data for downstream or UI consumption.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Service Integration Patterns",
"id": 188,
"rationale": "Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.",
"slug": "cloud-service-integration-patterns",
"source": "db"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cloud Architect",
"id": 11,
"rationale": null,
"role_archetype": null,
"slug": "cloud-architect",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Service Integration Patterns",
"id": 188,
"rationale": "Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.",
"slug": "cloud-service-integration-patterns",
"source": "db"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cloud Architect",
"id": 11,
"rationale": null,
"role_archetype": null,
"slug": "cloud-architect",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Testing and Validation Practices",
"id": 221,
"rationale": "Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.",
"slug": "testing-and-validation-practices",
"source": "db"
},
"input_skill": "Testability",
"llm_role": null,
"roles_from_db": [
{
"display_name": "ServiceNOW Developer",
"id": 24,
"rationale": null,
"role_archetype": null,
"slug": "servicenow-developer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Testing and Validation Practices",
"id": 221,
"rationale": "Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.",
"slug": "testing-and-validation-practices",
"source": "db"
},
"input_skill": "Testability",
"llm_role": null,
"roles_from_db": [
{
"display_name": "ServiceNOW Developer",
"id": 24,
"rationale": null,
"role_archetype": null,
"slug": "servicenow-developer",
"source": "db"
}
]
}
],
"input_final_skills": [
"Azure",
"Python",
"PySpark",
"SQL",
"MLflow",
"Azure Machine Learning",
"Azure Data Factory",
"Databricks",
"Notebooks",
"CI/CD",
"Big Data",
"Data Modeling",
"Data Pipelines",
"Data Ingestion",
"Stream Processing",
"Queueing",
"APIs",
"Testability",
"MLOps"
],
"input_llm_skills": [
"Azure",
"Python",
"PySpark",
"SQL",
"MLflow",
"Azure Machine Learning",
"Azure Data Factory",
"Databricks",
"Notebooks",
"CI/CD",
"Big Data",
"Data Modeling",
"Data Pipelines",
"Data Ingestion",
"Stream Processing",
"Queueing",
"APIs",
"Testability",
"MLOps"
],
"new_aliases_persisted": 0,
"run_id": "265899c9-6b42-43cb-a0f8-64ac64ac5a98",
"skills_detail": [
{
"aliases_in_db": [
{
"alias_text": "Azure",
"alias_type": "CANONICAL",
"id": 349,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 13,
"display_name": "Azure",
"id": 164,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "azure",
"sub_category_id": 161,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Platform Operations",
"id": 26,
"rationale": "Uses cloud provider services to support delivery and runtime environments. The focus is on consumer-level operation of cloud services rather than deep cloud architecture ownership.",
"slug": "cloud-platform-operations",
"source": "db"
},
"input_skill": "Azure",
"llm_role": null,
"roles_from_db": [
{
"display_name": "DevOps Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A DevOps Engineer enables reliable, repeatable delivery of software by designing and operating the processes that connect development and production. They focus on improving deployment flow, operational stability, and collaboration between teams through automation, standardization, and monitoring of delivery and runtime practices.",
"slug": "devops-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Security Platforms",
"id": 332,
"rationale": "Cloud-native security products used to assess posture, detect misconfigurations, and monitor workloads across AWS, Azure, and GCP. This is a distinct product family because the role often works across multiple CNAPP/CSPM/CWPP offerings and cloud-native detectors.",
"slug": "cloud-security-platforms",
"source": "db"
},
"input_skill": "Azure",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
]
}
],
"input_skill": "Azure",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "Python",
"alias_type": "CANONICAL",
"id": 608,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 2",
"alias_type": "VERSION",
"id": 611,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 2.x",
"alias_type": "VERSION",
"id": 613,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 3",
"alias_type": "VERSION",
"id": 612,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 3.10",
"alias_type": "VERSION",
"id": 2330,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 3.11",
"alias_type": "VERSION",
"id": 2331,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 3.12",
"alias_type": "VERSION",
"id": 2332,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "Python 3.x",
"alias_type": "VERSION",
"id": 614,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "py2",
"alias_type": "VERSION",
"id": 609,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "py3",
"alias_type": "VERSION",
"id": 610,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 2",
"alias_type": "VERSION",
"id": 2152,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 2.x",
"alias_type": "VERSION",
"id": 2154,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 3",
"alias_type": "VERSION",
"id": 990,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 3.10",
"alias_type": "VERSION",
"id": 992,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 3.11",
"alias_type": "VERSION",
"id": 993,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 3.12",
"alias_type": "VERSION",
"id": 994,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python 3.x",
"alias_type": "VERSION",
"id": 991,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python2",
"alias_type": "VERSION",
"id": 2150,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
},
{
"alias_text": "python3",
"alias_type": "VERSION",
"id": 989,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 5,
"display_name": "Python",
"id": 393,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "LANGUAGE",
"slug": "python",
"sub_category_id": 54,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Analytical Programming Languages",
"id": 82,
"rationale": "Languages used to clean, transform, analyze, and prototype models in notebooks and scripts. This is the core coding surface for expressing statistical logic and data manipulation in a reproducible way.",
"slug": "analytical-programming-languages",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Analyst",
"id": 20,
"rationale": null,
"role_archetype": null,
"slug": "data-analyst",
"source": "db"
},
{
"display_name": "Data Scientist",
"id": 7,
"rationale": null,
"role_archetype": null,
"slug": "data-scientist",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Automation Scripting and CLI",
"id": 48,
"rationale": "Uses scripts and command-line tooling to execute repeatable Azure operations and reduce manual work. This is a practical cluster because the role frequently automates provisioning, checks, and remediation tasks.",
"slug": "automation-scripting-and-cli",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Azure Cloud Engineer",
"id": 4,
"rationale": null,
"role_archetype": null,
"slug": "azure-cloud-engineer",
"source": "db"
},
{
"display_name": "Cloud Engineer",
"id": 18,
"rationale": null,
"role_archetype": null,
"slug": "cloud-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Automation and Scripting for Operations",
"id": 361,
"rationale": "Scripts and lightweight automation used to execute repetitive virtualization tasks and enforce operational consistency. This is the practical glue that reduces manual host and VM administration.",
"slug": "automation-and-scripting-for-operations",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Virtualization Engineer",
"id": 26,
"rationale": null,
"role_archetype": null,
"slug": "virtualization-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Network Automation and Scripting",
"id": 285,
"rationale": "Covers scripts and automation used to configure, validate, and audit network devices and services. This cluster is coherent because repeatable network operations increasingly depend on programmatic changes and checks.",
"slug": "network-automation-and-scripting",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Network Engineer",
"id": 21,
"rationale": null,
"role_archetype": null,
"slug": "network-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for AI Workflows",
"id": 261,
"rationale": "Languages used to implement AI feature logic, orchestration, and response handling inside product code. This is the core coding surface for turning prompts and model calls into reliable application behavior.",
"slug": "programming-languages-for-ai-workflows",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "AI Engineer",
"id": 12,
"rationale": null,
"role_archetype": null,
"slug": "ai-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Backend Systems",
"id": 140,
"rationale": "Languages used to implement server-side business logic, request handlers, workers, and service integrations. This is the core coding surface for backend feature delivery and maintenance.",
"slug": "programming-languages-for-backend-systems",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Data Work",
"id": 67,
"rationale": "Languages used to implement data pipelines, transformations, and operational utilities. This is the code layer for expressing extraction, parsing, validation, and orchestration logic in data engineering workflows.",
"slug": "programming-languages-for-data-work",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for ML Systems",
"id": 113,
"rationale": "Languages used to implement model integration code, inference services, and feature-processing logic. This is the core coding surface for turning trained models into product-facing software components.",
"slug": "programming-languages-for-ml-systems",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Security Work",
"id": 328,
"rationale": "Languages used to automate security tasks, write detection logic, and build analysis or remediation tooling. This is the core coding surface for a cybersecurity engineer across scripts, queries, and small utilities.",
"slug": "programming-languages-for-security-work",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Test Automation",
"id": 193,
"rationale": "Languages used to implement automated checks, helper utilities, and test harness code. This is the core coding surface for turning test ideas into maintainable automation.",
"slug": "programming-languages-for-test-automation",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Automation Tester",
"id": 16,
"rationale": null,
"role_archetype": null,
"slug": "automation-tester",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Security Automation and Scripting",
"id": 258,
"rationale": "Automating repeatable security checks, enrichment, and remediation workflows. This cluster is coherent because the role often needs lightweight automation to scale analysis and response.",
"slug": "security-automation-and-scripting",
"source": "db"
},
"input_skill": "Python",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
]
}
],
"input_skill": "Python",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": null,
"display_name": "Analytical Programming and Notebook Languages",
"id": null,
"rationale": "Languages and notebook/script-based coding used to clean, transform, analyze, and prototype data workflows and models. Includes Python, pandas, SQL, PySpark, notebook scripting, dataframe manipulation, exploratory analysis, ETL/data transformation logic, and other reproducible analytical code.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "PySpark",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "PySpark",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "PySpark",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Library",
"skill_nature": "LIBRARY",
"sub_category": "data_processing_library",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "PySpark is a specific Python API for Apache Spark and is usually named distinctly in JDs. It is unlikely to be reasonably confused with another catalog skill in typical job descriptions."
},
"context_keywords": {
"context_keywords": [
"Spark SQL",
"DataFrame",
"RDD",
"Spark Streaming",
"Structured Streaming",
"Delta Lake",
"Hive",
"Parquet",
"YARN",
"Databricks",
"EMR",
"AWS Glue",
"ETL",
"partitioning",
"broadcast join"
]
},
"maturity": {
"confidence": 0.93,
"maturity": "well_known",
"reasoning": "PySpark appears in many data engineering and analytics job descriptions, especially for Spark-based ETL and ML pipelines; it remains a standard skill alongside Databricks and AWS EMR."
},
"skill_id": "pyspark",
"vendor_license": {
"confidence": 0.98,
"license": "apache_2",
"vendor": "Apache Software Foundation",
"year_introduced": 2010
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Languages and notebook/script-based coding used to clean, transform, analyze, and prototype data workflows and models. Includes Python, pandas, SQL, PySpark, notebook scripting, dataframe manipulation, exploratory analysis, ETL/data transformation logic, and other reproducible analytical code.",
"exemplar_skills": [
"Analytical Programming and Notebook Languages"
],
"in_scope": "Skills, tools, and practices that belong under Analytical Programming and Notebook Languages for the target role, including items implied by the dimension rationale.",
"name": "Analytical Programming and Notebook Languages",
"out_of_scope": "Adjacent clusters explicitly not owned by Analytical Programming and Notebook Languages, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "d_merge_01"
},
{
"description": "Covers writing and optimizing distributed batch or streaming data transformations on large datasets. PySpark belongs here because it is a Spark-based API used to express parallel data processing jobs at scale.",
"exemplar_skills": [
"PySpark",
"Spark DataFrame API",
"Spark SQL",
"RDDs",
"Spark joins",
"Spark window functions"
],
"in_scope": "PySpark, Spark DataFrame transformations, Spark SQL, RDD operations, joins, aggregations, partitioning, shuffles, window functions, UDFs, batch ETL jobs, distributed data cleansing",
"name": "Distributed Data Processing",
"out_of_scope": "Interactive BI dashboards, ad hoc SQL reporting, model training algorithms, low-level cluster administration, message broker configuration, which belong to analytics, ML, platform, or streaming infrastructure dimensions",
"overlap_flags": [
{
"reason": "Spark can consume streams, but this dimension is about distributed computation rather than brokered event transport.",
"with_dim_id": "messaging-and-event-streaming",
"with_dim_name": null,
"with_role": "Backend Engineer"
},
{
"reason": "PySpark is also a programming language surface, but the stronger fit here is distributed data processing on Spark.",
"with_dim_id": "analytical-programming-languages",
"with_dim_name": null,
"with_role": "Data Analyst, Data Scientist"
}
],
"tentative_id": "d_init_01"
}
],
"merge_log": [
{
"a_dim_id": "analytical-programming-languages",
"a_name": "Analytical Programming Languages",
"a_role": "__skill_focal__",
"b_dim_id": "analytical-programming-languages",
"b_name": "Analytical Programming Languages",
"b_role": "Data Scientist",
"into": "d_merge_01",
"into_name": "Analytical Programming and Notebook Languages",
"merged_from": [
"analytical-programming-languages",
"analytical-programming-languages"
],
"pair_kind": "cross_role",
"reasoning": "Both dims describe the same cluster: analytical coding in notebooks/scripts for data cleaning, transformation, analysis, and prototyping. Dim A lists PySpark, Python data wrangling, pandas, SQL for analysis, notebook scripting, ETL logic, and dataframe manipulation. Dim B describes the same core surface as languages used to clean, transform, analyze, and prototype models in notebooks and scripts, i.e. reproducible statistical/data-manipulation code. The cross-role difference is only framing; the underlying skills overlap heavily.",
"similarity": 0.8286169942979785
}
],
"placed": {
"name": "PySpark",
"placement_confidence": 0.92,
"primary_dimension": "d_merge_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 2 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"d_init_01"
],
"skill_id": "pyspark"
},
"relationships": {
"child_skills": [],
"parent_skills": [
"python"
],
"related_to": [
"sql",
"postgresql",
"nosql",
"amazon-athena",
"amazon-sagemaker",
"aws-data-pipeline",
"aws-lambda",
"kubeflow",
"machine-learning",
"elasticsearch"
],
"requires": [],
"skill_id": "pyspark",
"suppress_on_match": []
},
"skill_id": "pyspark",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.93,
"name": "PySpark",
"reasoning": "PySpark is best classified as a Library because it is a Python package imported and used from application code, rather than a hosted environment or a framework you build inside.",
"skill_id": "pyspark",
"subtype": "data_processing_library",
"type": "Library"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:41-\u003e2"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "SQL",
"alias_type": "CANONICAL",
"id": 3398,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 5,
"display_name": "SQL",
"id": 2601,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "LANGUAGE",
"slug": "sql",
"sub_category_id": 55,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Relational Data Modeling",
"id": 71,
"rationale": "Designing tables, relationships, constraints, and transactional data shapes for operational backend systems. This cluster is coherent because backend services frequently own the canonical application data model.",
"slug": "relational-data-modeling",
"source": "db"
},
"input_skill": "SQL",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "SQL",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "SQL",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "MLflow",
"alias_type": "CANONICAL",
"id": 3593,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 11,
"display_name": "MLflow",
"id": 2640,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "TOOL",
"slug": "mlflow",
"sub_category_id": 2151,
"typical_lifespan": "EVERGREEN",
"volatility": "EMERGING"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Model Serving Deployment and Runtime Packaging",
"id": 52,
"rationale": "Operational deployment of trained models into online, batch, or streaming serving environments, including packaging models and model servers into containers or managed inference runtimes, coordinating rollout, and handing off to inference systems. Covers serving frameworks and platforms such as TensorFlow Serving, TorchServe, Triton Inference Server, BentoML, KServe, and Seldon Core, plus container/runtime concerns like Docker images, GPU-enabled containers, base image selection, container entrypoints, runtime dependencies, and image scanning for model services.",
"slug": "model-serving-deployment-and-runtime-packaging",
"source": "db"
},
"input_skill": "MLflow",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Project Delivery and Coordination",
"id": 366,
"rationale": "Coordination practices for organizing work, tracking progress, and aligning stakeholders across a delivery effort. Agile fits here when used as a team execution framework for managing scope, cadence, and collaboration.",
"slug": "d_init_02",
"source": "db"
},
"input_skill": "MLflow",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "MLflow",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "MLflow",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "Azure Machine Learning",
"alias_type": "CANONICAL",
"id": 600,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 13,
"display_name": "Azure Machine Learning",
"id": 385,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "azure-machine-learning",
"sub_category_id": 326,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud ML Platform Operations",
"id": 65,
"rationale": "Consumer-level operation of managed ML services and cloud resources used to train and serve models. This covers the cloud platform surface that MLOps engineers use without owning the underlying cloud platform itself.",
"slug": "cloud-ml-platform-operations",
"source": "db"
},
"input_skill": "Azure Machine Learning",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
]
}
],
"input_skill": "Azure Machine Learning",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "Azure Data Factory",
"alias_type": "CANONICAL",
"id": 731,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 14,
"display_name": "Azure Data Factory",
"id": 467,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "CLOUD_SERVICE",
"slug": "azure-data-factory",
"sub_category_id": 385,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Data Platform Services",
"id": 81,
"rationale": "Consumer-level use of cloud services that support data engineering workloads. This includes managed compute, storage, networking-adjacent services, and security primitives used to run pipelines and data platforms.",
"slug": "cloud-data-platform-services",
"source": "db"
},
"input_skill": "Azure Data Factory",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
]
}
],
"input_skill": "Azure Data Factory",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "Databricks",
"alias_type": "CANONICAL",
"id": 601,
"is_primary": true,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 13,
"display_name": "Databricks",
"id": 386,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "PLATFORM",
"slug": "databricks",
"sub_category_id": 323,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud ML Platform Operations",
"id": 65,
"rationale": "Consumer-level operation of managed ML services and cloud resources used to train and serve models. This covers the cloud platform surface that MLOps engineers use without owning the underlying cloud platform itself.",
"slug": "cloud-ml-platform-operations",
"source": "db"
},
"input_skill": "Databricks",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
]
}
],
"input_skill": "Databricks",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": null,
"display_name": "Analytical Programming and Notebook-Based Data Analysis",
"id": null,
"rationale": "Languages and notebook-friendly coding used to clean, transform, analyze, and prototype data and model workflows. This includes Python, R, SQL, and Scala used in notebooks or scripts for data wrangling, exploratory data analysis, statistical logic, feature engineering, and reproducible prototyping. It excludes production orchestration and scheduling, dashboard/report authoring, model deployment packaging, database administration, and UI development.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Notebooks",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Notebooks",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Tool",
"skill_nature": "TOOL",
"sub_category": "notebook_environment",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": true,
"confused_with": [
"jupyter_notebook",
"colab"
],
"reasoning": "\u201cNotebooks\u201d is a generic term and in JDs could mean Jupyter notebooks or Google Colab, both common catalog skills. The standalone name is too broad to be unambiguous."
},
"context_keywords": {
"context_keywords": [
"Jupyter",
"JupyterLab",
"IPython",
"nbconvert",
"nbformat",
"kernel",
"Markdown",
"code cells",
"data visualization",
"pandas",
"NumPy",
"Matplotlib",
"interactive analysis",
"reproducible research",
"collaboration"
]
},
"maturity": {
"confidence": 0.93,
"maturity": "well_known",
"reasoning": "Notebook environments (e.g., Jupyter) appear in many data science and ML job descriptions and are a standard workflow in major cloud vendors\u2019 managed notebook offerings."
},
"skill_id": "notebooks",
"vendor_license": {
"confidence": 0.93,
"license": "bsd",
"vendor": "Project Jupyter",
"year_introduced": 2014
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Languages and notebook-friendly coding used to clean, transform, analyze, and prototype data and model workflows. This includes Python, R, SQL, and Scala used in notebooks or scripts for data wrangling, exploratory data analysis, statistical logic, feature engineering, and reproducible prototyping. It excludes production orchestration and scheduling, dashboard/report authoring, model deployment packaging, database administration, and UI development.",
"exemplar_skills": [
"Analytical Programming and Notebook-Based Data Analysis"
],
"in_scope": "Skills, tools, and practices that belong under Analytical Programming and Notebook-Based Data Analysis for the target role, including items implied by the dimension rationale.",
"name": "Analytical Programming and Notebook-Based Data Analysis",
"out_of_scope": "Adjacent clusters explicitly not owned by Analytical Programming and Notebook-Based Data Analysis, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "d_merge_01"
}
],
"merge_log": [
{
"a_dim_id": "analytical-programming-languages",
"a_name": "Analytical Programming Languages",
"a_role": "__skill_focal__",
"b_dim_id": "analytical-programming-languages",
"b_name": "Analytical Programming Languages",
"b_role": "Data Scientist",
"into": "d_merge_01",
"into_name": "Analytical Programming and Notebook-Based Data Analysis",
"merged_from": [
"analytical-programming-languages",
"analytical-programming-languages"
],
"pair_kind": "cross_role",
"reasoning": "Both dims describe the same analytical coding cluster: notebook/script-based use of Python, R, SQL, and Scala to clean, transform, analyze, and prototype data or models. Dim A\u2019s exemplars (Notebooks, Python, R, SQL, Scala, Data wrangling, Exploratory data analysis, Feature engineering) match Dim B\u2019s description of languages for statistical logic and data manipulation in notebooks and scripts. The extra role label does not change the substance; the overlap is real, not just similar wording.",
"similarity": 0.8752619088294021
}
],
"placed": {
"name": "Notebooks",
"placement_confidence": 0.92,
"primary_dimension": "d_merge_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 1 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [],
"skill_id": "notebooks"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"runbooks",
"codex",
"document-intelligence",
"document-processing",
"ocr",
"ml",
"python",
"bash",
"linux",
"data-structures"
],
"requires": [],
"skill_id": "notebooks",
"suppress_on_match": []
},
"skill_id": "notebooks",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.9,
"name": "Notebooks",
"reasoning": "Notebooks are software you operate to write and run analyses interactively, so by the Tool vs Framework rule they are best classified as a tool rather than a framework or platform.",
"skill_id": "notebooks",
"subtype": "notebook_environment",
"type": "Tool"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:40-\u003e1"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "CI/CD",
"alias_type": "CANONICAL",
"id": 3376,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 7,
"display_name": "CI/CD",
"id": 2579,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "METHODOLOGY",
"slug": "ci-cd",
"sub_category_id": 2102,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "CI/CD",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "CI/CD",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Big Data",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Big Data",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Big Data",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
}
],
"input_skill": "Big Data",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Domain",
"skill_nature": "CONCEPT",
"sub_category": "data_intensive_computing",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cBig Data\u201d is a well-established domain term with a specific meaning in JDs. It is unlikely to be reasonably confused with another catalog skill in typical extraction contexts."
},
"context_keywords": {
"context_keywords": [
"Hadoop",
"Spark",
"Hive",
"Kafka",
"HDFS",
"MapReduce",
"ETL",
"data lake",
"data warehouse",
"NoSQL",
"Parquet",
"Airflow",
"Flink",
"Scala",
"YARN"
]
},
"maturity": {
"confidence": 0.92,
"maturity": "well_known",
"reasoning": "Common in data/platform job descriptions across industries; JD volume remains high for Hadoop/Spark/streaming stacks, and cloud vendors market managed big-data services as standard offerings."
},
"skill_id": "big-data",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [
{
"a_dim_id": "messaging-and-event-streaming",
"a_name": "Messaging and Event Streaming",
"a_role": "__skill_focal__",
"b_dim_id": "messaging-and-event-streaming",
"b_name": "Messaging and Event Streaming",
"b_role": "Backend Engineer",
"pair_kind": "cross_role",
"reasoning": "Dim A is analytics-oriented: its description says it is for \"data movement and event-driven pipelines used to feed large-scale analytics systems,\" with exemplars like Kafka, Spark Streaming, Apache Flink, stream ingestion, and near-real-time analytics. Dim B is backend-oriented: it covers \"asynchronous communication patterns and systems for decoupled service interaction and background processing,\" i.e., queues/topics/event streams for server-side workflows. Same technology words, different conceptual anchors and role usage.",
"similarity": 0.7224413551673322
}
],
"locked_dimensions": [
{
"description": "Large-scale data processing systems and techniques for storing, transforming, and analyzing high-volume, high-velocity datasets. Big Data belongs here because the term usually refers to the distributed data engineering stack rather than a single tool.",
"exemplar_skills": [
"Big Data",
"Hadoop",
"Apache Spark",
"Hive",
"MapReduce",
"distributed ETL",
"data lake processing"
],
"in_scope": "Big Data, Hadoop, Spark, Hive, MapReduce, distributed batch processing, large-scale ETL, data lakes, cluster-based data processing",
"name": "Big Data Processing",
"out_of_scope": "Data warehouse modeling and BI reporting, ad hoc SQL optimization, model training and serving, network or storage hardware operations",
"overlap_flags": [
{
"reason": "Big data platforms often ingest from streams, but this dimension is about the processing stack rather than asynchronous messaging itself.",
"with_dim_id": "messaging-and-event-streaming",
"with_dim_name": null,
"with_role": "Backend Engineer"
},
{
"reason": "Big data workloads rely on query tuning, but that catalog dimension is narrower and focused on access-path performance.",
"with_dim_id": "data-access-and-query-optimization",
"with_dim_name": null,
"with_role": "Data Engineer"
}
],
"tentative_id": "d_init_01"
},
{
"description": "Asynchronous data movement and event-driven pipelines used to feed large-scale analytics systems. Big Data often overlaps with this area when the skill is used in streaming ingestion or pipeline orchestration.",
"exemplar_skills": [
"Big Data",
"Kafka",
"Spark Streaming",
"Apache Flink",
"event-driven pipelines",
"stream ingestion",
"real-time analytics"
],
"in_scope": "Big Data, Kafka, Spark Streaming, Flink, event hubs, pub-sub pipelines, stream ingestion, near-real-time analytics",
"name": "Messaging and Event Streaming",
"out_of_scope": "Batch-only distributed computation, storage layout tuning, SQL query optimization, dashboarding and reporting, model deployment",
"overlap_flags": [
{
"reason": "Many big data solutions combine batch and streaming processing, so the boundary depends on whether the emphasis is computation or event transport.",
"with_dim_id": "d_init_01",
"with_dim_name": null,
"with_role": null
}
],
"tentative_id": "messaging-and-event-streaming"
},
{
"description": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"exemplar_skills": [
"Messaging and Event Streaming"
],
"in_scope": "Skills, tools, and practices that belong under Messaging and Event Streaming for the target role, including items implied by the dimension rationale.",
"name": "Messaging and Event Streaming",
"out_of_scope": "Adjacent clusters explicitly not owned by Messaging and Event Streaming, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "messaging-and-event-streaming"
}
],
"merge_log": [],
"placed": {
"name": "Big Data",
"placement_confidence": 0.92,
"primary_dimension": "d_init_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 3 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"messaging-and-event-streaming"
],
"skill_id": "big-data"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"machine-learning",
"nosql",
"ai-ml",
"devops",
"amazon-athena",
"elasticsearch",
"sql",
"mysql",
"postgresql",
"artificial-intelligence"
],
"requires": [],
"skill_id": "big-data",
"suppress_on_match": []
},
"skill_id": "big-data",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.93,
"name": "Big Data",
"reasoning": "Big Data is a vertical/problem-space body of knowledge rather than a tool, framework, or architecture, so it fits the Domain rule.",
"skill_id": "big-data",
"subtype": "data_intensive_computing",
"type": "Domain"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:42-\u003e3"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Data Modeling",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Data Modeling",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Concept",
"skill_nature": "CONCEPT",
"sub_category": "data_modeling",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cData Modeling\u201d is a standard, well-scoped concept in JDs and is unlikely to be confused with a different catalog skill in typical usage."
},
"context_keywords": {
"context_keywords": [
"ER diagrams",
"normalization",
"denormalization",
"star schema",
"snowflake schema",
"fact table",
"dimension table",
"OLTP",
"OLAP",
"entity-relationship",
"schema design",
"data warehouse",
"dimensional modeling",
"primary key",
"foreign key"
]
},
"maturity": {
"confidence": 0.93,
"maturity": "well_known",
"reasoning": "Data modeling appears in many data engineer, DBA, and analytics JDs, and is a standard prerequisite alongside SQL and database design rather than a niche specialty."
},
"skill_id": "data-modeling",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Designing the logical and physical structure of data so it is consistent, queryable, and fit for downstream analytics or operational use. This belongs here because the skill centers on defining entities, relationships, keys, and schemas rather than storage tuning or pipeline execution.",
"exemplar_skills": [
"Data Modeling",
"Schema Design",
"Dimensional Modeling",
"Entity-Relationship Modeling",
"Normalization",
"Star Schema Design",
"Snowflake Schema Design"
],
"in_scope": "Data Modeling, conceptual/logical/physical schema design, entities and relationships, normalization and denormalization, primary and foreign keys, dimensional modeling, star and snowflake schemas, fact and dimension tables, schema evolution",
"name": "Data Modeling",
"out_of_scope": "Data Access and Query Optimization, file layout and partitioning choices, ETL orchestration and data movement, dashboard/report design, database performance tuning, application API payload shaping",
"overlap_flags": [
{
"reason": "Data models influence query performance, but this dimension owns the structural design rather than access-path tuning.",
"with_dim_id": "data-access-and-query-optimization",
"with_dim_name": null,
"with_role": "Data Engineer"
},
{
"reason": "Event schemas and contracts can be modeled here, but the messaging dimension owns transport and delivery semantics.",
"with_dim_id": "messaging-and-event-streaming",
"with_dim_name": null,
"with_role": "Backend Engineer"
}
],
"tentative_id": "d_init_01"
}
],
"merge_log": [],
"placed": {
"name": "Data Modeling",
"placement_confidence": 0.92,
"primary_dimension": "d_init_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 1 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [],
"skill_id": "data-modeling"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"data-structures",
"sql",
"nosql",
"storage-layout",
"derived-views",
"metadata-json",
"postgresql",
"document-processing",
"failure-analysis",
"capacity-forecasting"
],
"requires": [],
"skill_id": "data-modeling",
"suppress_on_match": []
},
"skill_id": "data-modeling",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.96,
"name": "Data Modeling",
"reasoning": "Data Modeling is fundamentally a knowledge unit about how to structure and relate data, so by the Concept vs Methodology rule it is a Concept rather than a process or tool.",
"skill_id": "data-modeling",
"subtype": "data_modeling",
"type": "Concept"
},
"warnings": []
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": null,
"display_name": "Inference Data Pipelines for Serving and Batch Scoring",
"id": null,
"rationale": "Operational data movement that prepares and delivers timely, reliable data to production inference systems. Includes batch scoring inputs, feature refresh jobs, inference-time preprocessing, scheduled extracts, data validation for serving, and online/offline feature synchronization. Excludes training dataset curation, model training workflows, experimentation-focused feature engineering, model evaluation, and serving infrastructure/routing.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Data Pipelines",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Data Pipelines",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Data Pipelines",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Architecture",
"skill_nature": "PATTERN",
"sub_category": "data_pipeline_architecture",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cData Pipelines\u201d is a fairly specific architecture term and is unlikely to be mistaken for a different catalog skill in a typical JD."
},
"context_keywords": {
"context_keywords": [
"ETL",
"ELT",
"Apache Airflow",
"Apache NiFi",
"Kafka",
"Spark",
"dbt",
"orchestration",
"batch processing",
"stream processing",
"data ingestion",
"data warehouse",
"data lake",
"schema evolution",
"data quality"
]
},
"maturity": {
"confidence": 0.93,
"maturity": "well_known",
"reasoning": "Data pipelines are a common requirement in cloud/data engineering JDs, with frequent mentions alongside Airflow, Spark, and ETL/ELT stacks; broad hiring demand signals mainstream adoption."
},
"skill_id": "data-pipelines",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Operational data movement that prepares and delivers timely, reliable data to production inference systems. Includes batch scoring inputs, feature refresh jobs, inference-time preprocessing, scheduled extracts, data validation for serving, and online/offline feature synchronization. Excludes training dataset curation, model training workflows, experimentation-focused feature engineering, model evaluation, and serving infrastructure/routing.",
"exemplar_skills": [
"Inference Data Pipelines for Serving and Batch Scoring"
],
"in_scope": "Skills, tools, and practices that belong under Inference Data Pipelines for Serving and Batch Scoring for the target role, including items implied by the dimension rationale.",
"name": "Inference Data Pipelines for Serving and Batch Scoring",
"out_of_scope": "Adjacent clusters explicitly not owned by Inference Data Pipelines for Serving and Batch Scoring, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "d_merge_01"
},
{
"description": "Designing, scheduling, and coordinating end-to-end data movement and transformation jobs. This is the best fit when Data Pipelines refers to building reliable multi-step workflows across sources, transforms, and sinks.",
"exemplar_skills": [
"Data Pipelines",
"workflow orchestration",
"ETL orchestration",
"DAG scheduling",
"backfills",
"Airflow",
"Dagster",
"Prefect"
],
"in_scope": "Data Pipelines, workflow scheduling, DAG orchestration, dependency management, retries and backfills, ETL/ELT job coordination, Airflow, Dagster, Prefect, dbt orchestration",
"name": "Data Pipeline Orchestration",
"out_of_scope": "Low-level query tuning, storage layout optimization, message broker internals, model serving, dashboard/report generation",
"overlap_flags": [
{
"reason": "Streaming systems can trigger pipeline steps, but this dimension is about orchestration rather than asynchronous messaging semantics.",
"with_dim_id": "messaging-and-event-streaming",
"with_dim_name": null,
"with_role": "Backend Engineer"
},
{
"reason": "Pipeline jobs often include SQL, but query tuning and physical access-path optimization belong to the analytical storage/query dimension.",
"with_dim_id": "data-access-and-query-optimization",
"with_dim_name": null,
"with_role": "Data Engineer"
}
],
"tentative_id": "d_init_01"
}
],
"merge_log": [
{
"a_dim_id": "inference-data-pipelines",
"a_name": "Inference Data Pipelines",
"a_role": "__skill_focal__",
"b_dim_id": "inference-data-pipelines",
"b_name": "Inference Data Pipelines",
"b_role": "MLOps Engineer",
"into": "d_merge_01",
"into_name": "Inference Data Pipelines for Serving and Batch Scoring",
"merged_from": [
"inference-data-pipelines",
"inference-data-pipelines"
],
"pair_kind": "cross_role",
"reasoning": "Both dims describe the same serving-oriented data movement cluster. Dim A covers batch scoring inputs, feature refresh pipelines, inference-time preprocessing, and online/offline feature sync, while Dim B uses the same language and adds that it is separate from model training. Dim A\u2019s out_of_scope already excludes training and serving infra, so the overlap is substantive, not just name similarity.",
"similarity": 0.9616676657656642
}
],
"placed": {
"name": "Data Pipelines",
"placement_confidence": 0.92,
"primary_dimension": "d_merge_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 2 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"d_init_01"
],
"skill_id": "data-pipelines"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"workflow-automation",
"ci-cd",
"devops",
"kubeflow",
"mlops",
"ai-ml",
"rest-apis",
"sql",
"document-processing",
"agentic-workflows"
],
"requires": [],
"skill_id": "data-pipelines",
"suppress_on_match": []
},
"skill_id": "data-pipelines",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.9,
"name": "Data Pipelines",
"reasoning": "By the Architecture vs Concept rule, data pipelines describe a system-shape pattern for moving and transforming data across stages rather than a single knowledge unit or tool.",
"skill_id": "data-pipelines",
"subtype": "data_pipeline_architecture",
"type": "Architecture"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:41-\u003e2"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": null,
"display_name": "Asynchronous Messaging and Event Streaming",
"id": null,
"rationale": "Covers asynchronous communication and data movement through queues, topics, streams, event buses, and pub/sub systems for decoupled processing, background jobs, and event-driven integration. Includes continuous or event-driven data ingestion and change data capture pipelines, but excludes batch ETL orchestration, warehouse modeling, query optimization, model training data prep, and direct application API calls.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Data Ingestion",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "Data Ingestion",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Data Ingestion",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Concept",
"skill_nature": "CONCEPT",
"sub_category": "data_ingestion",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cData Ingestion\u201d is a standard, specific concept in data engineering and is unlikely to be mistaken for a different catalog skill in typical job descriptions."
},
"context_keywords": {
"context_keywords": [
"ETL",
"ELT",
"batch processing",
"streaming",
"Kafka",
"Apache NiFi",
"Airflow",
"CDC",
"S3",
"schema validation",
"data pipeline",
"message queue",
"Parquet",
"JSON",
"API ingestion"
]
},
"maturity": {
"confidence": 0.86,
"maturity": "well_known",
"reasoning": "Commonly appears in data/platform job descriptions and cloud vendor docs as a core pipeline capability; often paired with ETL/ELT, Kafka, and Airflow rather than treated as a niche specialty."
},
"skill_id": "data-ingestion",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Covers asynchronous communication and data movement through queues, topics, streams, event buses, and pub/sub systems for decoupled processing, background jobs, and event-driven integration. Includes continuous or event-driven data ingestion and change data capture pipelines, but excludes batch ETL orchestration, warehouse modeling, query optimization, model training data prep, and direct application API calls.",
"exemplar_skills": [
"Asynchronous Messaging and Event Streaming"
],
"in_scope": "Skills, tools, and practices that belong under Asynchronous Messaging and Event Streaming for the target role, including items implied by the dimension rationale.",
"name": "Asynchronous Messaging and Event Streaming",
"out_of_scope": "Adjacent clusters explicitly not owned by Asynchronous Messaging and Event Streaming, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "d_merge_01"
},
{
"description": "Covers scheduled or bulk loading of data from files, databases, and external systems into analytical or operational stores. Data Ingestion fits here when the emphasis is on landing, validating, and loading datasets rather than streaming transport.",
"exemplar_skills": [
"Data Ingestion",
"ETL Ingestion",
"ELT",
"Bulk Load",
"Incremental Load",
"Schema Validation",
"File Import Pipelines",
"JDBC Extracts"
],
"in_scope": "Data Ingestion, ETL ingestion, ELT landing, file imports, S3/GCS/Azure Blob loads, JDBC extracts, bulk loads, schema validation, deduplication, incremental loads",
"name": "Batch Data Ingestion Pipelines",
"out_of_scope": "Real-time event streaming, message brokers, API request orchestration, warehouse query tuning, downstream analytics modeling",
"overlap_flags": [
{
"reason": "Some ingestion pipelines pull from cloud services, but this dimension focuses on bulk movement and landing mechanics.",
"with_dim_id": "cloud-service-integration-patterns",
"with_dim_name": null,
"with_role": "Cloud Architect"
},
{
"reason": "Ingested data often lands in analytical stores, but query tuning is about how data is read after ingestion.",
"with_dim_id": "data-access-and-query-optimization",
"with_dim_name": null,
"with_role": "Data Engineer"
}
],
"tentative_id": "d_init_01"
}
],
"merge_log": [
{
"a_dim_id": "messaging-and-event-streaming",
"a_name": "Messaging and Event Streaming",
"a_role": "__skill_focal__",
"b_dim_id": "messaging-and-event-streaming",
"b_name": "Messaging and Event Streaming",
"b_role": "Backend Engineer",
"into": "d_merge_01",
"into_name": "Asynchronous Messaging and Event Streaming",
"merged_from": [
"messaging-and-event-streaming",
"messaging-and-event-streaming"
],
"pair_kind": "cross_role",
"reasoning": "Both dims describe the same cluster: asynchronous, decoupled communication via queues, topics, streams, and event buses. Dim A frames it as moving data through messaging/streaming systems and explicitly includes Kafka, Kinesis, RabbitMQ, Event Streaming, and Change Data Capture. Dim B frames the same mechanisms as backend asynchronous communication and background processing. The role difference is only emphasis, not substance, so the overlap is a true match.",
"similarity": 0.7307849947371147
}
],
"placed": {
"name": "Data Ingestion",
"placement_confidence": 0.92,
"primary_dimension": "d_merge_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 2 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"d_init_01"
],
"skill_id": "data-ingestion"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"document-processing",
"document-intelligence",
"retrieval",
"hybrid-retrieval",
"aws-data-pipeline",
"devops",
"observability",
"containers",
"storage-layout",
"multimodal-document-understanding"
],
"requires": [],
"skill_id": "data-ingestion",
"suppress_on_match": []
},
"skill_id": "data-ingestion",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.93,
"name": "Data Ingestion",
"reasoning": "Data Ingestion is fundamentally a named knowledge unit about bringing data into systems, so it fits the Concept category rather than a tool, platform, or methodology.",
"skill_id": "data-ingestion",
"subtype": "data_ingestion",
"type": "Concept"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:41-\u003e2"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"input_skill": "Stream Processing",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
]
}
],
"input_skill": "Stream Processing",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Architecture",
"skill_nature": "PATTERN",
"sub_category": "stream_processing_architecture",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "The term is fairly specific in JDs and usually refers to event/data stream processing architecture, not a different catalog skill. It is unlikely to be confused with another skill name in typical job descriptions."
},
"context_keywords": {
"context_keywords": [
"Apache Kafka",
"Apache Flink",
"Apache Spark Streaming",
"Apache Storm",
"event-driven architecture",
"pub/sub",
"message broker",
"consumer group",
"windowing",
"checkpointing",
"exactly-once semantics",
"backpressure",
"event time",
"watermarking",
"CDC"
]
},
"maturity": {
"confidence": 0.93,
"maturity": "well_known",
"reasoning": "Common in JDs for Kafka/Flink/Spark Streaming and cloud services like Kinesis/Pub/Sub; broad market adoption for real-time event pipelines."
},
"skill_id": "stream-processing",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Processing continuous event data as it arrives, using stream processors, windows, and stateful operators to transform and route records in near real time. This belongs here because stream processing is the core execution model for event-driven pipelines and low-latency data movement.",
"exemplar_skills": [
"Stream Processing",
"Apache Flink",
"Spark Structured Streaming",
"Kafka Streams",
"Apache Beam",
"event-time processing",
"windowing",
"watermarking"
],
"in_scope": "Stream Processing, Apache Kafka Streams, Apache Flink, Spark Structured Streaming, Apache Beam, event-time processing, windowing, watermarking, stateful transforms, joins on streams, exactly-once processing, checkpointing, backpressure handling",
"name": "Stream Processing",
"out_of_scope": "Batch ETL jobs, offline warehouse transformations, model training pipelines, message broker administration, low-level network transport, dashboard/reporting logic",
"overlap_flags": [
{
"reason": "Streaming pipelines can feed inference features or scoring jobs, but this dimension is about the real-time processing mechanics rather than model data movement.",
"with_dim_id": "inference-data-pipelines",
"with_dim_name": null,
"with_role": "MLOps Engineer"
},
{
"reason": "Stream processors often use partitioning and state stores, but query tuning is a separate concern from stream execution semantics.",
"with_dim_id": "data-access-and-query-optimization",
"with_dim_name": null,
"with_role": "Data Engineer"
}
],
"tentative_id": "messaging-and-event-streaming"
}
],
"merge_log": [],
"placed": {
"name": "Stream Processing",
"placement_confidence": 0.92,
"primary_dimension": "messaging-and-event-streaming",
"reasoning": "Deterministic JD placement: locked_dimensions has 1 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [],
"skill_id": "stream-processing"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"document-processing",
"layout-parsing",
"workflow-automation",
"aws-data-pipeline",
"event-logs",
"snapshot",
"state-transitions",
"proxy-patterns",
"agentic-workflows",
"event-emission"
],
"requires": [],
"skill_id": "stream-processing",
"suppress_on_match": []
},
"skill_id": "stream-processing",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.9,
"name": "Stream Processing",
"reasoning": "Stream Processing is fundamentally a system-shape for handling continuous event flows, so by the Architecture vs Concept rule it fits Architecture rather than a tool or methodology.",
"skill_id": "stream-processing",
"subtype": "stream_processing_architecture",
"type": "Architecture"
},
"warnings": []
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": null,
"display_name": "Messaging, Queueing, and Event Streaming",
"id": null,
"rationale": "Asynchronous communication patterns and systems that decouple producers and consumers, buffer and route work items, and support background processing and service-to-service integration. Includes queueing, message queues, pub/sub, brokers, topics, consumer groups, producers/consumers, dead-letter queues, retry handling, backpressure, and event streaming platforms such as Kafka, RabbitMQ, SQS, and Azure Service Bus.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "Queueing",
"llm_role": null,
"roles_from_db": []
}
],
"input_skill": "Queueing",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Concept",
"skill_nature": "CONCEPT",
"sub_category": "queueing_theory",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "Queueing is a fairly specific operations-research concept; in typical JDs it is unlikely to be mistaken for a different catalog skill."
},
"context_keywords": {
"context_keywords": [
"Little\u0027s Law",
"M/M/1",
"M/M/c",
"Poisson process",
"service rate",
"arrival rate",
"waiting time",
"throughput",
"utilization",
"backlog",
"buffering",
"congestion",
"discrete-event simulation",
"priority queue",
"SLA"
]
},
"maturity": {
"confidence": 0.86,
"maturity": "well_known",
"reasoning": "Queueing theory is a standard CS/ops concept and appears in many systems, SRE, and performance-engineering job descriptions; it is not a sunset technology and remains a common interview/topic area."
},
"skill_id": "queueing",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [],
"locked_dimensions": [
{
"description": "Asynchronous communication patterns and systems that decouple producers and consumers, buffer and route work items, and support background processing and service-to-service integration. Includes queueing, message queues, pub/sub, brokers, topics, consumer groups, producers/consumers, dead-letter queues, retry handling, backpressure, and event streaming platforms such as Kafka, RabbitMQ, SQS, and Azure Service Bus.",
"exemplar_skills": [
"Messaging, Queueing, and Event Streaming"
],
"in_scope": "Skills, tools, and practices that belong under Messaging, Queueing, and Event Streaming for the target role, including items implied by the dimension rationale.",
"name": "Messaging, Queueing, and Event Streaming",
"out_of_scope": "Adjacent clusters explicitly not owned by Messaging, Queueing, and Event Streaming, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "d_merge_01"
}
],
"merge_log": [
{
"a_dim_id": "messaging-and-event-streaming",
"a_name": "Messaging and Event Streaming",
"a_role": "__skill_focal__",
"b_dim_id": "messaging-and-event-streaming",
"b_name": "Messaging and Event Streaming",
"b_role": "Backend Engineer",
"into": "d_merge_01",
"into_name": "Messaging, Queueing, and Event Streaming",
"merged_from": [
"messaging-and-event-streaming",
"messaging-and-event-streaming"
],
"pair_kind": "cross_role",
"reasoning": "Both dims describe the same backend messaging cluster: asynchronous communication via queues, brokers, pub/sub, and event streams for decoupled service interaction and background processing. Dim A is the more detailed version, explicitly listing queueing, message queues, Kafka, RabbitMQ, SQS, dead-letter queues, retry handling, and backpressure. Dim B states the same substance in broader terms and adds no distinct skill area. Cross-role similarity is expected here; the underlying cluster is identical.",
"similarity": 0.8270280068284861
}
],
"placed": {
"name": "Queueing",
"placement_confidence": 0.92,
"primary_dimension": "d_merge_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 1 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [],
"skill_id": "queueing"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"capacity-forecasting",
"capacity-alerts",
"workflow-automation",
"event-emission",
"retrieval",
"reranking",
"context-management",
"scrum",
"devops",
"containers"
],
"requires": [],
"skill_id": "queueing",
"suppress_on_match": []
},
"skill_id": "queueing",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.93,
"name": "Queueing",
"reasoning": "Queueing is fundamentally a knowledge unit about how waiting lines and work distribution behave, so by the Concept vs Methodology rule it is a Concept rather than an Architecture or Tool.",
"skill_id": "queueing",
"subtype": "queueing_theory",
"type": "Concept"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:40-\u003e1"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": null,
"display_name": "API Integration, Request Orchestration, and Data Fetching",
"id": null,
"rationale": "Connecting applications to internal or external services through request/response APIs. This includes consuming REST and GraphQL endpoints, orchestrating requests, handling payloads and response parsing, pagination, retries, error handling, and shaping remote data for downstream or UI consumption.",
"slug": "d_merge_01",
"source": "llm"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Service Integration Patterns",
"id": 188,
"rationale": "Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.",
"slug": "cloud-service-integration-patterns",
"source": "db"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cloud Architect",
"id": 11,
"rationale": null,
"role_archetype": null,
"slug": "cloud-architect",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": []
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Service Integration Patterns",
"id": 188,
"rationale": "Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.",
"slug": "cloud-service-integration-patterns",
"source": "db"
},
"input_skill": "APIs",
"llm_role": null,
"roles_from_db": [
{
"display_name": "Cloud Architect",
"id": 11,
"rationale": null,
"role_archetype": null,
"slug": "cloud-architect",
"source": "db"
}
]
}
],
"input_skill": "APIs",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Protocol",
"skill_nature": "PROTOCOL",
"sub_category": "application_programming_interfaces",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cAPIs\u201d is a standard, widely used term in JDs and usually refers unambiguously to application programming interfaces; it is not typically confused with a distinct catalog skill."
},
"context_keywords": {
"context_keywords": [
"REST",
"GraphQL",
"OpenAPI",
"Swagger",
"JSON",
"XML",
"OAuth 2.0",
"API gateway",
"endpoint",
"webhook",
"rate limiting",
"pagination",
"versioning",
"SDK",
"microservices"
]
},
"maturity": {
"confidence": 0.98,
"maturity": "well_known",
"reasoning": "APIs are a hiring-pipeline staple across backend, mobile, and platform JDs; REST/GraphQL/API design appears in large-volume job postings and cloud vendor docs."
},
"skill_id": "apis",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [
{
"a_dim_id": "cloud-service-integration-patterns",
"a_name": "Cloud Service Integration Patterns",
"a_role": "__skill_focal__",
"b_dim_id": "cloud-service-integration-patterns",
"b_name": "Cloud Service Integration Patterns",
"b_role": "Cloud Architect",
"pair_kind": "cross_role",
"reasoning": "Same wording, different level. Dim A is implementation-facing: APIs as an integration mechanism between cloud services/pipelines/platforms, with examples like RESTful integration, webhook consumers, shared service contracts, and cross-system orchestration; it excludes client parsing, broker internals, and auth-only concerns. Dim B is architect-facing: defining interaction patterns that preserve decoupling, security, and operability across cloud services and workloads. A covers how to integrate; B covers how to design the integration boundary. Different skills belong under each.",
"similarity": 0.8418532624733192
}
],
"locked_dimensions": [
{
"description": "Connecting applications to internal or external services through request/response APIs. This includes consuming REST and GraphQL endpoints, orchestrating requests, handling payloads and response parsing, pagination, retries, error handling, and shaping remote data for downstream or UI consumption.",
"exemplar_skills": [
"API Integration, Request Orchestration, and Data Fetching"
],
"in_scope": "Skills, tools, and practices that belong under API Integration, Request Orchestration, and Data Fetching for the target role, including items implied by the dimension rationale.",
"name": "API Integration, Request Orchestration, and Data Fetching",
"out_of_scope": "Adjacent clusters explicitly not owned by API Integration, Request Orchestration, and Data Fetching, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "d_merge_01"
},
{
"description": "How services connect across boundaries using APIs, events, and shared interfaces. The target skill belongs here when APIs are treated as an integration mechanism between cloud services, pipelines, or platforms.",
"exemplar_skills": [
"APIs",
"service integration",
"RESTful services",
"webhooks",
"integration patterns",
"service contracts"
],
"in_scope": "APIs, service-to-service calls, RESTful integration, webhook consumers, integration boundaries, shared service contracts, cross-system orchestration",
"name": "Cloud Service Integration Patterns",
"out_of_scope": "API client parsing details, UI data fetching, message broker internals, database access optimization, authentication-only concerns",
"overlap_flags": [
{
"reason": "Both dimensions cover API-based communication, but this one is broader and architecture-oriented rather than client-fetch focused.",
"with_dim_id": "api-integration-and-data-fetching",
"with_dim_name": null,
"with_role": "Frontend Engineer, Full Stack Developer"
},
{
"reason": "Integration architectures often combine APIs with asynchronous messaging, so boundary decisions can overlap.",
"with_dim_id": "messaging-and-event-streaming",
"with_dim_name": null,
"with_role": "Backend Engineer"
}
],
"tentative_id": "cloud-service-integration-patterns"
},
{
"description": "Defining API contracts, resource models, and request/response semantics for services. This dimension fits the target skill when APIs refers to designing or documenting interfaces rather than merely consuming them.",
"exemplar_skills": [
"APIs",
"REST API design",
"OpenAPI",
"Swagger",
"endpoint design",
"API versioning"
],
"in_scope": "APIs, REST resource design, endpoint naming, versioning, OpenAPI, Swagger, request/response schemas, status codes, idempotency, contract design",
"name": "API Design and Specification",
"out_of_scope": "Consuming APIs from client code, pagination handling in fetch logic, event streaming, database schema design, authentication implementation",
"overlap_flags": [
{
"reason": "API design overlaps with API consumption, but this dimension is about defining contracts rather than calling them.",
"with_dim_id": "api-integration-and-data-fetching",
"with_dim_name": null,
"with_role": "Frontend Engineer, Full Stack Developer"
},
{
"reason": "Well-designed APIs are often part of broader integration patterns across services and platforms.",
"with_dim_id": "cloud-service-integration-patterns",
"with_dim_name": null,
"with_role": "Cloud Architect"
}
],
"tentative_id": "d_init_01"
},
{
"description": "Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.",
"exemplar_skills": [
"Cloud Service Integration Patterns"
],
"in_scope": "Skills, tools, and practices that belong under Cloud Service Integration Patterns for the target role, including items implied by the dimension rationale.",
"name": "Cloud Service Integration Patterns",
"out_of_scope": "Adjacent clusters explicitly not owned by Cloud Service Integration Patterns, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "cloud-service-integration-patterns"
}
],
"merge_log": [
{
"a_dim_id": "api-integration-and-data-fetching",
"a_name": "API Integration and Data Fetching",
"a_role": "__skill_focal__",
"b_dim_id": "api-integration-and-data-fetching",
"b_name": "API Integration and Data Fetching",
"b_role": "Full Stack Developer",
"into": "d_merge_01",
"into_name": "API Integration, Request Orchestration, and Data Fetching",
"merged_from": [
"api-integration-and-data-fetching",
"api-integration-and-data-fetching"
],
"pair_kind": "cross_role",
"reasoning": "Both dimensions describe the same skill cluster: consuming APIs and fetching remote data via request/response calls. Dim A covers external/internal services, REST, GraphQL, pagination, retries, payload handling, and contract-aware data shaping; Dim B covers frontend-to-backend and third-party endpoints, request orchestration, error handling, pagination, and shaping remote data for UI use. The exemplar skills in A (REST APIs, GraphQL, HTTP request handling, response parsing) match B\u2019s described work, so this is a wording difference, not a distinct cluster.",
"similarity": 0.8030311798440094
}
],
"placed": {
"name": "APIs",
"placement_confidence": 0.92,
"primary_dimension": "d_merge_01",
"reasoning": "Deterministic JD placement: locked_dimensions has 4 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [
"cloud-service-integration-patterns",
"d_init_01"
],
"skill_id": "apis"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"rest-apis",
"amazon-api-gateway",
"infura",
"aws-lambda",
"proxy-patterns",
"workflow-automation",
"agent-tooling",
"aws-data-pipeline",
"minio",
"authentication"
],
"requires": [],
"skill_id": "apis",
"suppress_on_match": []
},
"skill_id": "apis",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.91,
"name": "APIs",
"reasoning": "APIs are a communication interface standard between systems, so by the Protocol vs Standard rule they fit best as a Protocol rather than a tool or platform.",
"skill_id": "apis",
"subtype": "application_programming_interfaces",
"type": "Protocol"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:42-\u003e4"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [],
"canonical": null,
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Testing and Validation Practices",
"id": 221,
"rationale": "Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.",
"slug": "testing-and-validation-practices",
"source": "db"
},
"input_skill": "Testability",
"llm_role": null,
"roles_from_db": [
{
"display_name": "ServiceNOW Developer",
"id": 24,
"rationale": null,
"role_archetype": null,
"slug": "servicenow-developer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Testing and Validation Practices",
"id": 221,
"rationale": "Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.",
"slug": "testing-and-validation-practices",
"source": "db"
},
"input_skill": "Testability",
"llm_role": null,
"roles_from_db": [
{
"display_name": "ServiceNOW Developer",
"id": 24,
"rationale": null,
"role_archetype": null,
"slug": "servicenow-developer",
"source": "db"
}
]
}
],
"input_skill": "Testability",
"matched_via": null,
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": {
"derived": {
"category": "Concept",
"skill_nature": "CONCEPT",
"sub_category": "software_testability_concept",
"typical_lifespan": "EVERGREEN",
"version_strategy": "NOT_APPLICABLE",
"volatility": "STABLE"
},
"enrichment": {
"ambiguity": {
"ambiguity_flag": false,
"confused_with": [],
"reasoning": "\u201cTestability\u201d is a specific software engineering concept and is unlikely to be mistaken for a different catalog skill in typical job descriptions."
},
"context_keywords": {
"context_keywords": [
"unit tests",
"integration tests",
"test coverage",
"mocking",
"dependency injection",
"assertions",
"test harness",
"automated testing",
"regression testing",
"test doubles",
"stubs",
"fixtures",
"TDD",
"CI/CD",
"code coverage"
]
},
"maturity": {
"confidence": 0.93,
"maturity": "well_known",
"reasoning": "Testability is a common requirement in software engineering JDs and interview rubrics, often paired with unit/integration testing, CI, and TDD; it\u2019s a standard quality attribute rather than a niche tool."
},
"skill_id": "testability",
"vendor_license": {
"confidence": 0.99,
"license": null,
"vendor": null,
"year_introduced": null
},
"versioning": {
"current_version": null,
"version_aliases": {},
"versioned": false
}
},
"keep_log": [
{
"a_dim_id": "testing-and-validation-practices",
"a_name": "Testing and Validation Practices",
"a_role": "__skill_focal__",
"b_dim_id": "testing-and-validation-practices",
"b_name": "Testing and Validation Practices",
"b_role": "ServiceNOW Developer",
"pair_kind": "cross_role",
"reasoning": "Both dims share the label, but A is general software testing practice: test design, regression checks, harnesses, mocks/stubs, assertions, fixtures, and testability. B is ServiceNow-specific release validation: checking workflows, scripts, and integrations behave as intended. A\u2019s exemplars (unit/integration testing, mocking and stubbing) are about building test infrastructure; B\u2019s description is about platform change verification. Same umbrella word, different skill clusters.",
"similarity": 0.6642010668546267
}
],
"locked_dimensions": [
{
"description": "Practices for verifying that software changes behave correctly before release, including test design, regression checks, and validation workflows. Testability belongs here because it describes how easily a system can be exercised and verified by tests.",
"exemplar_skills": [
"Testability",
"unit testing",
"integration testing",
"regression testing",
"test harness design",
"mocking and stubbing",
"validation checks"
],
"in_scope": "Testability, unit tests, integration tests, regression testing, test harnesses, mocks and stubs, assertions, test fixtures, validation checks",
"name": "Testing and Validation Practices",
"out_of_scope": "Test reporting and defect triage, manual test evidence capture, performance benchmarking, production monitoring, these belong to quality reporting or operations rather than making code easier to test",
"overlap_flags": [
{
"reason": "Testability can influence coverage and pass rates, but that dimension focuses on reporting outcomes rather than designing testable systems.",
"with_dim_id": "test-reporting-and-quality-metrics",
"with_dim_name": null,
"with_role": "Automation Tester"
}
],
"tentative_id": "testing-and-validation-practices"
},
{
"description": "Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.",
"exemplar_skills": [
"Testing and Validation Practices"
],
"in_scope": "Skills, tools, and practices that belong under Testing and Validation Practices for the target role, including items implied by the dimension rationale.",
"name": "Testing and Validation Practices",
"out_of_scope": "Adjacent clusters explicitly not owned by Testing and Validation Practices, including unrelated platforms, roles, and skill families per library policy.",
"overlap_flags": [],
"tentative_id": "testing-and-validation-practices"
}
],
"merge_log": [],
"placed": {
"name": "Testability",
"placement_confidence": 0.92,
"primary_dimension": "testing-and-validation-practices",
"reasoning": "Deterministic JD placement: locked_dimensions has 2 dimension(s) from skill-driven dimension generation after reconciliation; primary_dimension is the first locked dim.",
"secondary_dimensions": [],
"skill_id": "testability"
},
"relationships": {
"child_skills": [],
"parent_skills": [],
"related_to": [
"observability",
"failure-analysis",
"restore-testing",
"eval-design",
"evaluation",
"evaluation-design",
"code-review",
"ci-cd",
"devops",
"agile"
],
"requires": [],
"skill_id": "testability",
"suppress_on_match": []
},
"skill_id": "testability",
"split_log": [],
"typed": {
"alternatives_considered": [],
"confidence": 0.97,
"name": "Testability",
"reasoning": "By the Concept vs Methodology rule, testability is a named knowledge unit about how easily software can be tested, not a process or tool.",
"skill_id": "testability",
"subtype": "software_testability_concept",
"type": "Concept"
},
"warnings": [
"stage3_post_filter_dropped_catalog_only_locked_dims:41-\u003e2"
]
},
"source_tag": "llm",
"was_in_llm_skills": true
},
{
"aliases_in_db": [
{
"alias_text": "MLOps",
"alias_type": "CANONICAL",
"id": 3600,
"is_primary": false,
"match_strategy": "CASE_INSENSITIVE"
}
],
"canonical": {
"category_id": 7,
"display_name": "MLOps",
"id": 2643,
"is_also_category": false,
"is_extractable": true,
"skill_nature": "METHODOLOGY",
"slug": "mlops",
"sub_category_id": 2156,
"typical_lifespan": "EVERGREEN",
"volatility": "STABLE"
},
"dimensions": [
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Inference Data Pipelines",
"id": 59,
"rationale": "Operational data movement for batch scoring, feature refresh, and inference-time data preparation. This is separate from model training because it focuses on getting the right data to the serving path reliably.",
"slug": "inference-data-pipelines",
"source": "db"
},
"input_skill": "MLOps",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
]
},
{
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Model Serving Deployment and Runtime Packaging",
"id": 52,
"rationale": "Operational deployment of trained models into online, batch, or streaming serving environments, including packaging models and model servers into containers or managed inference runtimes, coordinating rollout, and handing off to inference systems. Covers serving frameworks and platforms such as TensorFlow Serving, TorchServe, Triton Inference Server, BentoML, KServe, and Seldon Core, plus container/runtime concerns like Docker images, GPU-enabled containers, base image selection, container entrypoints, runtime dependencies, and image scanning for model services.",
"slug": "model-serving-deployment-and-runtime-packaging",
"source": "db"
},
"input_skill": "MLOps",
"llm_role": null,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
]
}
],
"input_skill": "MLOps",
"matched_via": "alias",
"new_alias_persisted": false,
"new_alias_text": null,
"new_skill_meta": null,
"source_tag": "db",
"was_in_llm_skills": true
}
],
"unmatched_skills": [
"PySpark",
"Notebooks",
"Big Data",
"Data Modeling",
"Data Pipelines",
"Data Ingestion",
"Stream Processing",
"Queueing",
"APIs",
"Testability"
]
}
API 3 — final-role-output
{
"chosen_role": {
"display_name": "Data Engineer",
"id": 6,
"rationale": "The primary skills indicate a strong focus on data processing, SQL, and Azure technologies, aligning well with a Data Engineer\u0027s responsibilities.",
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
},
"chosen_role_resolution": "in_db",
"final_input_skills": [
{
"skill": "Azure",
"tag": "in_db"
},
{
"skill": "Python",
"tag": "in_db"
},
{
"skill": "PySpark",
"tag": "new"
},
{
"skill": "SQL",
"tag": "in_db"
},
{
"skill": "MLflow",
"tag": "in_db"
},
{
"skill": "Azure Machine Learning",
"tag": "in_db"
},
{
"skill": "Azure Data Factory",
"tag": "in_db"
},
{
"skill": "Databricks",
"tag": "in_db"
},
{
"skill": "Notebooks",
"tag": "new"
},
{
"skill": "CI/CD",
"tag": "in_db"
},
{
"skill": "Big Data",
"tag": "new"
},
{
"skill": "Data Modeling",
"tag": "new"
},
{
"skill": "Data Pipelines",
"tag": "new"
},
{
"skill": "Data Ingestion",
"tag": "new"
},
{
"skill": "Stream Processing",
"tag": "new"
},
{
"skill": "Queueing",
"tag": "new"
},
{
"skill": "APIs",
"tag": "new"
},
{
"skill": "Testability",
"tag": "new"
},
{
"skill": "MLOps",
"tag": "in_db"
}
],
"persistence": {
"items": [
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Platform Operations",
"id": 26,
"rationale": "Uses cloud provider services to support delivery and runtime environments. The focus is on consumer-level operation of cloud services rather than deep cloud architecture ownership.",
"slug": "cloud-platform-operations",
"source": "db"
},
"dimension_id": 26,
"input_skill": "Azure",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "DevOps Engineer",
"id": 1,
"rationale": null,
"role_archetype": "A DevOps Engineer enables reliable, repeatable delivery of software by designing and operating the processes that connect development and production. They focus on improving deployment flow, operational stability, and collaboration between teams through automation, standardization, and monitoring of delivery and runtime practices.",
"slug": "devops-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 164,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Security Platforms",
"id": 332,
"rationale": "Cloud-native security products used to assess posture, detect misconfigurations, and monitor workloads across AWS, Azure, and GCP. This is a distinct product family because the role often works across multiple CNAPP/CSPM/CWPP offerings and cloud-native detectors.",
"slug": "cloud-security-platforms",
"source": "db"
},
"dimension_id": 332,
"input_skill": "Azure",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 164,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Analytical Programming Languages",
"id": 82,
"rationale": "Languages used to clean, transform, analyze, and prototype models in notebooks and scripts. This is the core coding surface for expressing statistical logic and data manipulation in a reproducible way.",
"slug": "analytical-programming-languages",
"source": "db"
},
"dimension_id": 82,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Data Analyst",
"id": 20,
"rationale": null,
"role_archetype": null,
"slug": "data-analyst",
"source": "db"
},
{
"display_name": "Data Scientist",
"id": 7,
"rationale": null,
"role_archetype": null,
"slug": "data-scientist",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Automation Scripting and CLI",
"id": 48,
"rationale": "Uses scripts and command-line tooling to execute repeatable Azure operations and reduce manual work. This is a practical cluster because the role frequently automates provisioning, checks, and remediation tasks.",
"slug": "automation-scripting-and-cli",
"source": "db"
},
"dimension_id": 48,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Azure Cloud Engineer",
"id": 4,
"rationale": null,
"role_archetype": null,
"slug": "azure-cloud-engineer",
"source": "db"
},
{
"display_name": "Cloud Engineer",
"id": 18,
"rationale": null,
"role_archetype": null,
"slug": "cloud-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Automation and Scripting for Operations",
"id": 361,
"rationale": "Scripts and lightweight automation used to execute repetitive virtualization tasks and enforce operational consistency. This is the practical glue that reduces manual host and VM administration.",
"slug": "automation-and-scripting-for-operations",
"source": "db"
},
"dimension_id": 361,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Virtualization Engineer",
"id": 26,
"rationale": null,
"role_archetype": null,
"slug": "virtualization-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Network Automation and Scripting",
"id": 285,
"rationale": "Covers scripts and automation used to configure, validate, and audit network devices and services. This cluster is coherent because repeatable network operations increasingly depend on programmatic changes and checks.",
"slug": "network-automation-and-scripting",
"source": "db"
},
"dimension_id": 285,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Network Engineer",
"id": 21,
"rationale": null,
"role_archetype": null,
"slug": "network-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for AI Workflows",
"id": 261,
"rationale": "Languages used to implement AI feature logic, orchestration, and response handling inside product code. This is the core coding surface for turning prompts and model calls into reliable application behavior.",
"slug": "programming-languages-for-ai-workflows",
"source": "db"
},
"dimension_id": 261,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "AI Engineer",
"id": 12,
"rationale": null,
"role_archetype": null,
"slug": "ai-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Backend Systems",
"id": 140,
"rationale": "Languages used to implement server-side business logic, request handlers, workers, and service integrations. This is the core coding surface for backend feature delivery and maintenance.",
"slug": "programming-languages-for-backend-systems",
"source": "db"
},
"dimension_id": 140,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Data Work",
"id": 67,
"rationale": "Languages used to implement data pipelines, transformations, and operational utilities. This is the code layer for expressing extraction, parsing, validation, and orchestration logic in data engineering workflows.",
"slug": "programming-languages-for-data-work",
"source": "db"
},
"dimension_id": 67,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": true,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
"role_dimension_saved": true,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for ML Systems",
"id": 113,
"rationale": "Languages used to implement model integration code, inference services, and feature-processing logic. This is the core coding surface for turning trained models into product-facing software components.",
"slug": "programming-languages-for-ml-systems",
"source": "db"
},
"dimension_id": 113,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Security Work",
"id": 328,
"rationale": "Languages used to automate security tasks, write detection logic, and build analysis or remediation tooling. This is the core coding surface for a cybersecurity engineer across scripts, queries, and small utilities.",
"slug": "programming-languages-for-security-work",
"source": "db"
},
"dimension_id": 328,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Programming Languages for Test Automation",
"id": 193,
"rationale": "Languages used to implement automated checks, helper utilities, and test harness code. This is the core coding surface for turning test ideas into maintainable automation.",
"slug": "programming-languages-for-test-automation",
"source": "db"
},
"dimension_id": 193,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Automation Tester",
"id": 16,
"rationale": null,
"role_archetype": null,
"slug": "automation-tester",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Security Automation and Scripting",
"id": 258,
"rationale": "Automating repeatable security checks, enrichment, and remediation workflows. This cluster is coherent because the role often needs lightweight automation to scale analysis and response.",
"slug": "security-automation-and-scripting",
"source": "db"
},
"dimension_id": 258,
"input_skill": "Python",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Cybersecurity Engineer",
"id": 9,
"rationale": null,
"role_archetype": null,
"slug": "cybersecurity-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 393,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Relational Data Modeling",
"id": 71,
"rationale": "Designing tables, relationships, constraints, and transactional data shapes for operational backend systems. This cluster is coherent because backend services frequently own the canonical application data model.",
"slug": "relational-data-modeling",
"source": "db"
},
"dimension_id": 71,
"input_skill": "SQL",
"llm_role": null,
"matched_chosen_role": true,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
"role_dimension_saved": true,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
},
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2601,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "SQL",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2601,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Model Serving Deployment and Runtime Packaging",
"id": 52,
"rationale": "Operational deployment of trained models into online, batch, or streaming serving environments, including packaging models and model servers into containers or managed inference runtimes, coordinating rollout, and handing off to inference systems. Covers serving frameworks and platforms such as TensorFlow Serving, TorchServe, Triton Inference Server, BentoML, KServe, and Seldon Core, plus container/runtime concerns like Docker images, GPU-enabled containers, base image selection, container entrypoints, runtime dependencies, and image scanning for model services.",
"slug": "model-serving-deployment-and-runtime-packaging",
"source": "db"
},
"dimension_id": 52,
"input_skill": "MLflow",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2640,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Project Delivery and Coordination",
"id": 366,
"rationale": "Coordination practices for organizing work, tracking progress, and aligning stakeholders across a delivery effort. Agile fits here when used as a team execution framework for managing scope, cadence, and collaboration.",
"slug": "d_init_02",
"source": "db"
},
"dimension_id": 366,
"input_skill": "MLflow",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2640,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "MLflow",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2640,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud ML Platform Operations",
"id": 65,
"rationale": "Consumer-level operation of managed ML services and cloud resources used to train and serve models. This covers the cloud platform surface that MLOps engineers use without owning the underlying cloud platform itself.",
"slug": "cloud-ml-platform-operations",
"source": "db"
},
"dimension_id": 65,
"input_skill": "Azure Machine Learning",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 385,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Data Platform Services",
"id": 81,
"rationale": "Consumer-level use of cloud services that support data engineering workloads. This includes managed compute, storage, networking-adjacent services, and security primitives used to run pipelines and data platforms.",
"slug": "cloud-data-platform-services",
"source": "db"
},
"dimension_id": 81,
"input_skill": "Azure Data Factory",
"llm_role": null,
"matched_chosen_role": true,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension saved",
"role_dimension_saved": true,
"roles_from_db": [
{
"display_name": "Data Engineer",
"id": 6,
"rationale": null,
"role_archetype": null,
"slug": "data-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 467,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud ML Platform Operations",
"id": 65,
"rationale": "Consumer-level operation of managed ML services and cloud resources used to train and serve models. This covers the cloud platform surface that MLOps engineers use without owning the underlying cloud platform itself.",
"slug": "cloud-ml-platform-operations",
"source": "db"
},
"dimension_id": 65,
"input_skill": "Databricks",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 386,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "CI/CD",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2579,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Inference Data Pipelines",
"id": 59,
"rationale": "Operational data movement for batch scoring, feature refresh, and inference-time data preparation. This is separate from model training because it focuses on getting the right data to the serving path reliably.",
"slug": "inference-data-pipelines",
"source": "db"
},
"dimension_id": 59,
"input_skill": "MLOps",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2643,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Model Serving Deployment and Runtime Packaging",
"id": 52,
"rationale": "Operational deployment of trained models into online, batch, or streaming serving environments, including packaging models and model servers into containers or managed inference runtimes, coordinating rollout, and handing off to inference systems. Covers serving frameworks and platforms such as TensorFlow Serving, TorchServe, Triton Inference Server, BentoML, KServe, and Seldon Core, plus container/runtime concerns like Docker images, GPU-enabled containers, base image selection, container entrypoints, runtime dependencies, and image scanning for model services.",
"slug": "model-serving-deployment-and-runtime-packaging",
"source": "db"
},
"dimension_id": 52,
"input_skill": "MLOps",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "MLOps Engineer",
"id": 5,
"rationale": null,
"role_archetype": null,
"slug": "mlops-engineer",
"source": "db"
},
{
"display_name": "Machine Learning Engineer",
"id": 10,
"rationale": null,
"role_archetype": null,
"slug": "machine-learning-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2643,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": null,
"display_name": "Analytical Programming and Notebook Languages",
"id": null,
"rationale": "Languages and notebook/script-based coding used to clean, transform, analyze, and prototype data workflows and models. Includes Python, pandas, SQL, PySpark, notebook scripting, dataframe manipulation, exploratory analysis, ETL/data transformation logic, and other reproducible analytical code.",
"slug": "d_merge_01",
"source": "llm"
},
"dimension_id": 82,
"input_skill": "PySpark",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (reconciliation merge) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2684,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "PySpark",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2684,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": null,
"display_name": "Analytical Programming and Notebook-Based Data Analysis",
"id": null,
"rationale": "Languages and notebook-friendly coding used to clean, transform, analyze, and prototype data and model workflows. This includes Python, R, SQL, and Scala used in notebooks or scripts for data wrangling, exploratory data analysis, statistical logic, feature engineering, and reproducible prototyping. It excludes production orchestration and scheduling, dashboard/report authoring, model deployment packaging, database administration, and UI development.",
"slug": "d_merge_01",
"source": "llm"
},
"dimension_id": 82,
"input_skill": "Notebooks",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (reconciliation merge) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2685,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "Big Data",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2686,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"dimension_id": 146,
"input_skill": "Big Data",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2686,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "Data Modeling",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2687,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": null,
"display_name": "Inference Data Pipelines for Serving and Batch Scoring",
"id": null,
"rationale": "Operational data movement that prepares and delivers timely, reliable data to production inference systems. Includes batch scoring inputs, feature refresh jobs, inference-time preprocessing, scheduled extracts, data validation for serving, and online/offline feature synchronization. Excludes training dataset curation, model training workflows, experimentation-focused feature engineering, model evaluation, and serving infrastructure/routing.",
"slug": "d_merge_01",
"source": "llm"
},
"dimension_id": 59,
"input_skill": "Data Pipelines",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (reconciliation merge) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2688,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "Data Pipelines",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2688,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": null,
"display_name": "Asynchronous Messaging and Event Streaming",
"id": null,
"rationale": "Covers asynchronous communication and data movement through queues, topics, streams, event buses, and pub/sub systems for decoupled processing, background jobs, and event-driven integration. Includes continuous or event-driven data ingestion and change data capture pipelines, but excludes batch ETL orchestration, warehouse modeling, query optimization, model training data prep, and direct application API calls.",
"slug": "d_merge_01",
"source": "llm"
},
"dimension_id": 146,
"input_skill": "Data Ingestion",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (reconciliation merge) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2689,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "Data Ingestion",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2689,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Messaging and Event Streaming",
"id": 146,
"rationale": "Asynchronous communication patterns and systems for decoupled service interaction and background processing. This is a coherent backend cluster because many server-side workflows depend on queues, topics, and event streams.",
"slug": "messaging-and-event-streaming",
"source": "db"
},
"dimension_id": 146,
"input_skill": "Stream Processing",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Backend Engineer",
"id": 14,
"rationale": null,
"role_archetype": null,
"slug": "backend-engineer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2690,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": null,
"display_name": "Messaging, Queueing, and Event Streaming",
"id": null,
"rationale": "Asynchronous communication patterns and systems that decouple producers and consumers, buffer and route work items, and support background processing and service-to-service integration. Includes queueing, message queues, pub/sub, brokers, topics, consumer groups, producers/consumers, dead-letter queues, retry handling, backpressure, and event streaming platforms such as Kafka, RabbitMQ, SQS, and Azure Service Bus.",
"slug": "d_merge_01",
"source": "llm"
},
"dimension_id": 146,
"input_skill": "Queueing",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (reconciliation merge) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2691,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": null,
"display_name": "API Integration, Request Orchestration, and Data Fetching",
"id": null,
"rationale": "Connecting applications to internal or external services through request/response APIs. This includes consuming REST and GraphQL endpoints, orchestrating requests, handling payloads and response parsing, pagination, retries, error handling, and shaping remote data for downstream or UI consumption.",
"slug": "d_merge_01",
"source": "llm"
},
"dimension_id": 9,
"input_skill": "APIs",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (reconciliation merge) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2692,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Cloud Service Integration Patterns",
"id": 188,
"rationale": "Covers how cloud services and workloads connect through APIs, events, shared services, and integration boundaries. This cluster is coherent because architects must define interaction patterns that preserve decoupling, security, and operability.",
"slug": "cloud-service-integration-patterns",
"source": "db"
},
"dimension_id": 188,
"input_skill": "APIs",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "Cloud Architect",
"id": 11,
"rationale": null,
"role_archetype": null,
"slug": "cloud-architect",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2692,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Version Control Systems",
"id": 365,
"rationale": "Tools and workflows for tracking source changes, branching, merging, and collaborating on code history. Git belongs here because it is the canonical distributed version control system used to manage revisions and coordinate team development.",
"slug": "d_init_01",
"source": "db"
},
"dimension_id": 365,
"input_skill": "APIs",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [],
"skill_dimension_saved": true,
"skill_id": 2692,
"skill_tag": "in_db",
"skipped_reason": null
},
{
"chosen_role_id": 6,
"dimension": {
"difficulty_hint": "well_known",
"display_name": "Testing and Validation Practices",
"id": 221,
"rationale": "Validating platform changes before release, including functional checks and regression verification. This cluster is coherent because ServiceNow developers must confirm workflows, scripts, and integrations behave as intended.",
"slug": "testing-and-validation-practices",
"source": "db"
},
"dimension_id": 221,
"input_skill": "Testability",
"llm_role": null,
"matched_chosen_role": false,
"outcome_line": "New skill saved \u00b7 Existing dimension (library) \u00b7 Role\u2194dimension skipped (dimension not under chosen role)",
"role_dimension_saved": false,
"roles_from_db": [
{
"display_name": "ServiceNOW Developer",
"id": 24,
"rationale": null,
"role_archetype": null,
"slug": "servicenow-developer",
"source": "db"
}
],
"skill_dimension_saved": true,
"skill_id": 2693,
"skill_tag": "in_db",
"skipped_reason": null
}
],
"new_skills_created": 10,
"role_dimension_saved": 0,
"skill_dimension_saved": 16,
"skipped": 0
},
"planner_output": null,
"run_id": "265899c9-6b42-43cb-a0f8-64ac64ac5a98"
}
LLM Calls
Every model call made for this run, in pipeline order. Click a card to see the model's response.