Storage-aware task scheduling on big data workflows in distributed environments

Workflow Generator

What we want to do here is to generate a Big Data workflow.

Input: # of modules, # of links

Output:

Descriptions:

Dependencies[i][j] = 1 means that there is a directed edge from module i to module j
Map_data_function[i][j] = "linear" means that for module i, if its input data is from module j, then the output data size of module j will be a linear function of its input data size coming from module i.
Similarly for the remaining functions.

The objective of this tiny part is to generate a hierarchy structured cluster consisting of heterogenous computing nodes

Input: # of racks, # of nodes for each rack

Output:

A two dimensional matrix, each row represents a node, and each column is a property
- for each node, rack information of a node (1 dimension)
- for each node, bandwidth (up, down, in total 2 dimensions )
- for each node, # of CPU cores, CPU speed, memory size, storage capacity(4 dimensions)
A two dimensional matrix
- for each rack, inter rack bandwidth

Descriptions:

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
Task_Scheduling.iml		Task_Scheduling.iml