Building a Task Dependency System with DAGs and JGraphT

You're building a deployment pipeline. Task A needs to finish before Task B starts. Task C depends on both A and B. Task D only cares about B. Simple enough, right? Then someone asks, "What if we add Task E that depends on D, which depends on B, which depends on A?" Suddenly, your nicely organized if-statements turn into spaghetti code that would make an Italian chef weep.
This is where Directed Acyclic Graph, or DAG for short is useful. Despite sounding like something a computer science professor would drone about for three hours, DAGs are surprisingly practical for solving real-world problems like task scheduling, build systems, and data pipelines.
What Makes a DAG Actually Useful
A DAG is just a graph where edges have direction (A → B means A comes before B) and there are no cycles (you can't eventually loop back to where you started). The "no cycles" part is what makes them brilliant for dependencies. If Task A depends on Task B, which depends on Task C, which depends on Task A, you've got an infinite loop of existential dread. DAGs prevent that.
Think about your morning routine: you can't put on shoes before socks, but you can make coffee while the shower heats up. That's a DAG. Or consider a build system: compile the utils library before the main application, but tests can run in parallel once compilation finishes.
A Real Use Case: Data Processing Pipeline
Let's build something concrete: a data processing pipeline for an e-commerce analytics system. We need to:
Extract raw sales data from the database
Clean and validate the data
Calculate daily metrics (depends on cleaned data)
Calculate weekly aggregates (depends on daily metrics)
Generate reports (depends on weekly aggregates)
Send notifications (depends on reports)
But we also want to:
Calculate customer segments (depends on cleaned data, independent of metrics)
Update recommendation models (depends on customer segments)
This is perfect for a DAG. Some tasks can run in parallel, others must wait for dependencies.
JGraphT to the rescue
JGraphT is a Java library that handles graphs elegantly without forcing you to implement Dijkstra's algorithm for the thousandth time. It's mature, well-documented, and doesn't require a PhD to use.
First, add it to your Maven dependencies:
<dependency>
<groupId>org.jgrapht</groupId>
<artifactId>jgrapht-core</artifactId>
<version>1.5.2</version>
</dependency>
The Implementation
Here's a complete example of our data pipeline using JGraphT:
import org.jgrapht.Graph;
import org.jgrapht.graph.DefaultEdge;
import org.jgrapht.graph.DirectedAcyclicGraph;
import org.jgrapht.traverse.TopologicalOrderIterator;
import java.util.*;
import java.util.concurrent.*;
public class DataPipelineExecutor {
// Represents a task in our pipeline
static class Task {
private final String id;
private final Runnable action;
public Task(String id, Runnable action) {
this.id = id;
this.action = action;
}
public void execute() {
System.out.println("Starting: " + id);
action.run();
System.out.println("Completed: " + id);
}
public String getId() { return id; }
@Override
public String toString() { return id; }
}
public static void main(String[] args) throws Exception {
// Create the DAG
DirectedAcyclicGraph<Task, DefaultEdge> dag =
new DirectedAcyclicGraph<>(DefaultEdge.class);
// Define our tasks
Task extractData = new Task("extract_sales_data", () -> {
sleep(1000); // Simulate database query
});
Task cleanData = new Task("clean_data", () -> {
sleep(800); // Simulate data cleaning
});
Task calcDailyMetrics = new Task("calculate_daily_metrics", () -> {
sleep(1500); // Simulate calculations
});
Task calcWeeklyMetrics = new Task("calculate_weekly_metrics", () -> {
sleep(1200);
});
Task generateReports = new Task("generate_reports", () -> {
sleep(900);
});
Task sendNotifications = new Task("send_notifications", () -> {
sleep(500);
});
Task calcCustomerSegments = new Task("calculate_customer_segments", () -> {
sleep(1100);
});
Task updateRecommendations = new Task("update_recommendations", () -> {
sleep(700);
});
// Add all tasks to the graph
dag.addVertex(extractData);
dag.addVertex(cleanData);
dag.addVertex(calcDailyMetrics);
dag.addVertex(calcWeeklyMetrics);
dag.addVertex(generateReports);
dag.addVertex(sendNotifications);
dag.addVertex(calcCustomerSegments);
dag.addVertex(updateRecommendations);
// Define dependencies (edges)
dag.addEdge(extractData, cleanData);
dag.addEdge(cleanData, calcDailyMetrics);
dag.addEdge(calcDailyMetrics, calcWeeklyMetrics);
dag.addEdge(calcWeeklyMetrics, generateReports);
dag.addEdge(generateReports, sendNotifications);
// Parallel branch for customer segmentation
dag.addEdge(cleanData, calcCustomerSegments);
dag.addEdge(calcCustomerSegments, updateRecommendations);
// Execute the pipeline
System.out.println("Starting pipeline execution...\n");
executePipeline(dag);
}
private static void executePipeline(DirectedAcyclicGraph<Task, DefaultEdge> dag)
throws InterruptedException, ExecutionException {
// Topological sort gives us valid execution order
TopologicalOrderIterator<Task, DefaultEdge> iterator =
new TopologicalOrderIterator<>(dag);
ExecutorService executor = Executors.newFixedThreadPool(4);
Map<Task, Future<?>> futures = new HashMap<>();
Set<Task> completed = ConcurrentHashMap.newKeySet();
while (iterator.hasNext()) {
Task task = iterator.next();
// Submit task that waits for dependencies
Future<?> future = executor.submit(() -> {
// Wait for all dependencies to complete
for (DefaultEdge edge : dag.incomingEdgesOf(task)) {
Task dependency = dag.getEdgeSource(edge);
try {
futures.get(dependency).get(); // Wait for dependency
} catch (Exception e) {
e.printStackTrace();
}
}
// Execute the task
task.execute();
completed.add(task);
});
futures.put(task, future);
}
// Wait for all tasks to complete
for (Future<?> future : futures.values()) {
future.get();
}
executor.shutdown();
System.out.println("\nPipeline completed successfully!");
System.out.println("Total tasks executed: " + completed.size());
}
private static void sleep(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
What's Actually Happening Here
The DirectedAcyclicGraph class does the heavy lifting. When you add an edge from Task A to Task B, it automatically checks for cycles. Try adding dag.addEdge(sendNotifications, extractData) and watch it throw a GraphCycleProhibitedException. No infinite loops allowed.
The TopologicalOrderIterator is where the magic happens. It gives us a valid execution order where all dependencies are satisfied. Task A will always come before Task B if B depends on A. Tasks without dependencies between them (like calculating metrics and customer segments) can run in parallel.
Our executor uses this order and a thread pool to run tasks as soon as their dependencies complete. The result? Maximum parallelism while respecting all dependencies.
Why This Beats Writing Your Own Scheduler
You could write your own task scheduler with a bunch of if-statements and thread coordination. You could also write your own HashMap implementation. Both are bad ideas for the same reason: the edge cases will eat you alive.
What happens when someone adds a circular dependency? What if two tasks can run in parallel but you're running them sequentially? What about debugging which task is blocking the pipeline? JGraphT handles all this. You get cycle detection, topological sorting, and even graph visualization tools for free.
Beyond Data Pipelines
Once you have a DAG-based system, you can use it for all sorts of things:
Build Systems: Compile source files in dependency order, rebuild only what changed.
Microservice Orchestration: Service A needs data from Services B and C before processing.
ETL Workflows: Extract from multiple sources, transform in stages, load to different destinations.
Recipe Execution: Making pasta? Boil water, cook pasta, make sauce. Some steps wait, others happen simultaneously.
The pattern is the same: nodes are tasks, edges are dependencies, topological sort gives you execution order.
The Bottom Line
DAGs aren't just academic abstractions. They're practical tools for managing complex dependencies in real systems. JGraphT makes implementing them straightforward enough that you can focus on your actual problem instead of graph theory.



