Welcome to flowrunner

Introduction

flowrunner is a lightweight package to organize and represent Data Engineering/Science workflows. Its designed to be integrated with any pre-existing framework like pandas or PySpark

Concepts

  • Flow: A collection of all the functions you want to run, organized in way you want them to run, subclassed from BaseFlow

  • DAG: A Directed Acyclic Graph is a type of graph that is directed and without cycles connecting the other edges, meaning that it has a clear start and end node

  • flowrunner DAG: A dag we use to keep track and visualize the order of execution of a Flow

What is flowrunner

flowrunner in essence is a way to write quick ETL(Extract, Transform, Load)/Data WorkFlows in the form of a Directed Acyclical Graph called a Flow

Why flowrunner

FlowRunner is easy and lightweight and can fit on top of any existing framework like PySpark or Pandas. This addresses things that Airflow has trouble with like sharing data between tasks/dags through XCOM which limits to string format.

What flowrunner is not?

An orchestrator, flowrunner handles no part of the scheduling, it is recommneded to use it within another scheduler orchestrator like Airflow

Features

  • Lazy evaluation of DAG: flowrunner does not force you to execute/run your dag until you want to, only run it when its explicitly mentioned as run

  • Easy syntax to build new Flows

  • Easy data sharing between methods in a Flow using attributes

  • Data store to store output of a function(incase it has return) for later

  • Param store to easily pass reusable parameters to Flow

  • Visualizing your flow as a DAG