An Intermediate Data-driven Methodology for Scientific Workflow Management System to Support Reusability
Automatic processing of different logical sub-tasks by a set of rules is a workflow. A workflow management system (WfMS) is a system that helps us accomplish a complex scientific task through making a sequential arrangement of sub-tasks available as tools. Workflows are formed with modules from various domains in a WfMS, and many collaborators of the domains are involved in the workflow design process. Workflow Management Systems (WfMSs) have been gained popularity in recent years for managing various tools in a system and ensuring dependencies while building a sequence of executions for scientific analyses. As a result of heterogeneous tools involvement and collaboration requirement, Collaborative Scientific Workflow Management Systems (CSWfMS) have gained significant interest in the scientific analysis community. In such systems, big data explosion issues exist with massive velocity and variety characteristics for the heterogeneous large amount of data from different domains. Therefore a large amount of heterogeneous data need to be managed in a Scientific Workflow Management System (SWfMS) with a proper decision mechanism. Although a number of studies addressed the cost management of data, none of the existing studies are related to real- time decision mechanism or reusability mechanism. Besides, frequent execution of workflows in a SWfMS generates a massive amount of data and characteristics of such data are always incremental. Input data or module outcomes of a workflow in a SWfMS are usually large in size. Processing of such data-intensive workflows is usually time-consuming where modules are computationally expensive for their respective inputs. Besides, lack of data reusability, limitation of error recovery, inefficient workflow processing, inefficient storing of derived data, lacking in metadata association and lacking in validation of the effectiveness of a technique of existing systems need to be addressed in a SWfMS for efficient workflow building by maintaining the big data explosion. To address the issues, in this thesis first we propose an intermediate data management scheme for a SWfMS. In our second attempt, we explored the possibilities and introduced an automatic recommendation technique for a SWfMS from real-world workflow data (i.e Galaxy  workflows) where our investigations show that the proposed technique can facilitate 51% of workflow building in a SWfMS by reusing intermediate data of previous workflows and can reduce 74% execution time of workflow buildings in a SWfMS. Later we propose an adaptive version of our technique by considering the states of tools in a SWfMS, which shows around 40% reusability for workflows. Consequently, in our fourth study, We have done several experiments for analyzing the performance and exploring the effectiveness of the technique in a SWfMS for various environments. The technique is introduced to emphasize on storing cost reduction, increase data reusability, and faster workflow execution, to the best of our knowledge, which is the first of its kind. Detail architecture and evaluation of the technique are presented in this thesis. We believe our findings and developed system will contribute significantly to the research domain of SWfMSs.
Plant phenotyping, Association rules, Workflow, Intermediate states suggestion, Pipeline design, Data management
Master of Science (M.Sc.)