This article is about Technical diagram In the Data Engineering projects, looking at solutions such as Google Slides and Draw.io. It highlights the alignment challenges and concludes with a discussion on how to simplify this process.

Introduction

In a Data Engineering project, to have a complete representation of the project, a technical diagram summarizing the entire pipeline is necessary. How do you quickly and easily create a technical diagram that summarizes a project's data pipeline? This is a question that comes up frequently when it comes time to represent your pipeline. It is possible to do this directly on Google Slides by inserting shapes and arrows, but it can quickly become a pain when you have to resize the different parts, make them move together, or align the text. There are also tools, like draw.io, that facilitate the creation of diagrams by linking the different parts, however alignment problems persist and the pipeline will be created only with shapes.

To avoid all these problems and limitations, it is possible to use the Diagrams package which, in a few lines of code, will produce an easily readable technical diagram. In addition, creating a technical diagram with code makes it possible to reuse what has been done and if several people work collaboratively on the same diagram, it makes it easy to use a version control tool.

We will see in detail that Diagram is a flexible tool that makes it easy to produce technical diagrams while maintaining some clarity for readers. We are going to take a step-by-step look at how to use this package and its features.

Prerequisites

To be able to use the diagrams package, it is necessary to have Python 3.6 or higher. Then, you will have to install GraphViz because it is what allows you to display the charts. You can find GraphViz in the “Getting Started” section of github for the Diagram project.

Then you can install the diagrams library with your package manager, and then you'll be ready to start creating beautiful diagrams.

For my part I installed the package with pip

Pip Install Diagrams

The basics

In this package there are 4 different elements:

  • Diagram
  • The groups (Cluster)
  • The links (Edge)
  • The knots

The first 3 elements are characterized by a class respectively. Regarding nodes, there are many classes offered by different providers such as AWS, Azure or GCP for clouds or Kubernetes. You can find all the classes in the official package documentation.

Finally, these 4 elements are linked: in fact, a diagram consists of nodes that can be grouped together in groups and that are linked together by links. You will therefore have to import the necessary classes in order to be able to represent your architecture diagram correctly.

Now, let's try to code a first diagram to understand the basics of the package.

From Diagrams Import Diagram
From diagrams.aws.analytics Import Glue, Quicksight
From diagrams.aws.database Import RDS
From diagrams.aws.management Import Cloudwatch
From diagrams.aws.storage Import S3

With Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
db = RDS ('RDS Database')
jobs = Glue ('ETL')
log = Cloudwatch ('Logging')
bucket = S3 ('S3 Buckets')
dashboard = Quicksight ('Dashboard')

db >> jobs >> bucket >> dashboard
jobs >> log

This diagram describes a data engineering pipeline, using a table contained in an RDS database, processed using AWS Glue ETL. The processing results are stored in an S3 bucket and the logs are stored in Cloudwatch. Finally, a Quicksight dashboard is connected to the S3 bucket.

Let's look at this first piece of code in detail. First, we import the Diagram class, which is required to produce a diagram. Then we import a few node classes from the AWS provider, for example RDS, Glue, etc.

We then create a new Diagram with the noun 'Pipeline — Global Overview'. As we filled in the parameter filename, the diagram will be saved at the location indicated, attention the path indicated is a relative path (the root will be the same as that of the location where the code is executed, for example if the code is launched from the desktop, the diagram will be saved on the desktop) and not an absolute path. The parameter Show being equal to True, Python will open the graph immediately after the code is executed. The parameter managements indicates in which direction the graph will be built, here it will be from left to right (from Left to Right), which is the default setting. The other options are right to left (RL), top to bottom (TB), and bottom to top (BT). Inside the diagram, we create several nodes, using the classes we imported. For create a link between two nodes, you have to add '>>' between the two nodes if you want the arrow to go from left to right or '<<' if necessary.

In order to finish with all the classes in the package, let's try a slightly more complex diagram integrating clusters.

From Diagrams Import Diagram, Cluster
From diagrams.aws.analytics Import Glue, Quicksight
From diagrams.aws.database Import RDS
From diagrams.aws.management Import Cloudwatch
From diagrams.aws.storage Import S3

With Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
db = RDS ('RDS Database')

With Cluster ('AWS Glue (ETL)\nData Engineering\n(Filter, join, rename... ) '):
jobs = [Glue ('Job1'), Glue ('Job2')]

log = Cloudwatch ('Logging')
bucket = S3 ('S3 Buckets')
dashboard = Quicksight ('Dashboard')

db >> jobs >> bucket >> dashboard
jobs >> log

This diagram describes the same pipeline as the previous one, the only difference is that here there are two Jobs that are represented in AWS Glue.

The main purpose of Clusters is to group similar elements into the same subset.

The second use that I find for clusters is to be able to delineate the different parts of the pipeline even more clearly, as shown in the following example:

From Diagrams Import Diagram, Cluster
From diagrams.aws.analytics Import Glue, Quicksight
From diagrams.aws.database Import RDS
From diagrams.aws.management Import Cloudwatch
From diagrams.aws.storage Import S3

With Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With Cluster ('RDS'):
db = RDS ('PostgreSQL BDD\nstored in RDS')

With Cluster ('AWS Glue (ETL) '):
jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')

With Cluster ('Cloudwatch'):
log = Cloudwatch ('Monitoring Scripts')

With Cluster ('S3'):
bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')

With Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')

db >> jobs >> bucket >> dashboard
jobs >> log

The code will produce the same diagram again, but the appearance will be different, in fact each part of AWS that was used will be even more clearly identified.

Advanced settings

Now that we know how to use diagrams, clusters, clusters, edges, and nodes, let's focus on customization. There are two customizable objects: nodes and edges.

Customizing Edges

First, let's look at how to customize the edges. There are 3 customization parameters: color, style and label. The default color is gray, but if you want to set another color, the colors are those used by the matplotlib package that you can find </a >here.

Then, it is possible to play on the style and there are 4 available:

  • The default style which is a solid line
  • Bold, a solid line in bold
  • Dashed, the line is made of dashes
  • Dotted, the line is made of dots

It is not possible to combine different styles, i.e. have a line made of bold dashes.

Finally, it is possible to add a label to an Edge if you want to explain what this Edge represents.

Let's see what that looks like in code.

From Diagrams Import Diagram, Cluster, Edge
From diagrams.aws.analytics Import Glue, Quicksight
From diagrams.aws.database Import RDS
From diagrams.aws.management Import Cloudwatch
From diagrams.aws.storage Import S3

Def arrow (color='black', style='line', label=None):
“"”
Function to define the edge between the part of the diagram
:param color: the color of the edge, could be any color
:type color: str
:param style: the style of the edge, could be dashed, dotted, bold or line (default)
:type style: str
:param label: The text you want to show on the edge
:type label: str
:return: Edge object with the different parameters we set up
:type: Edge ()
“"”

Return Edge (color=color, style=style, label=label)

With Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With Cluster ('RDS'):
db = RDS ('PostgreSQL BDD\nstored in RDS')

With Cluster ('AWS Glue (ETL) '):
jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')

With Cluster ('Cloudwatch'):
log = Cloudwatch ('Monitoring Scripts')

With Cluster ('S3'):
bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')

With Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')

db_mycharlotte >> arrow (color='red', style='bold') >> jobs >> arrow (color='red', style='bold') >>\
bucket >> arrow (color='red', style='bold') >> dashboard

jobs >> arrow (color='hotpink', style='dashed') >> log

Here, I created a function, arrow (), which by default produces a bold black arrow, which I prefer to the package's default arrow. I then use this function to define the various arrows I want to have in my chart. When you want to customize an Edge, you must explicitly mark it in the diagram diagram between the two nodes concerned. Here I wanted the pipeline Edges to be in red and in bold, except for the arrow for the logs which is in pink and dotted lines. You can see that in the last two lines of the code.

Node Customization

Let's talk about the second point, node customization. What does that mean? Customizing nodes means displaying a Node with an image that is not already pre-registered in the package image bank and therefore creating a Node that does not exist.

Do you want to represent the sending of an email in case of error? This is not in the options available on the package, but all you need to do is upload an image representing an email and using the Node Custom (), you can integrate this new Node into your diagram. So there are a lot of Node possibilities and the only limit is your imagination.

From Diagrams Import Diagram, Cluster, Edge
From diagrams.aws.analytics Import Glue, Quicksight
From diagrams.aws.database Import RDS
From diagrams.aws.management Import Cloudwatch
From diagrams.aws.storage Import S3

With Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With Cluster ('RDS'):
db = RDS ('PostgreSQL BDD\nstored in RDS')

With Cluster ('AWS Glue (ETL) '):
jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')

With Cluster ('Cloudwatch'):
log = Cloudwatch ('Monitoring Scripts')

With Cluster ('S3'):
bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')

With Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')

With Cluster ('Devs'):
houcem = Custom ('Houcem\nLead DS', '... /Custom/houcem.png ')
nico = Custom ('Nico\nDS', '... /Custom/nico.png ')
dev = [nico, houcem]

db >> arrow (color='red', style='bold') >> jobs >> arrow (color='red', style='bold') >> bucket >>\
arrow (color='red', style='bold') >> dashboard jobs >> arrow (color='hotpink', style='dashed') >> log

houcem >> arrow (color='sandybrown', style='dotted') >> jobs
houcem >> arrow (color='sandybrown', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> jobs
nico >> arrow (color='blue', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> bucket
nico >> arrow (color='blue', style='dotted') >> dashboard

Here, I chose to represent the developers who worked on this project by specifying which parts of the pipeline they worked on. To make it more visual, I created two new Nodes with the photos of the developers and so, it is clear who to contact in case of problems with the pipeline.

Finally, once you have mastered the various functionalities of the package, you can produce very rich diagrams. Here is an example:

From Diagrams Import Diagram, Cluster, Edge
From diagrams.aws.analytics Import Glue, GlueCrawlers, GlueDataCatalog, Quicksight
From diagrams.aws.database Import RDS
From diagrams.aws.management Import Cloudwatch
From diagrams.aws.storage Import S3, SimpleStorageServices3Object, SimpleStorageServices3Object, SimpleStorageServices3Object
From graphs.custom Import Custom

With Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With Cluster ('RDS'):
db_mycharlotte = RDS ('PostgreSQL BDD\nstored in RDS')

With Cluster ('AWS Glue'):
crawler = glueCrawlers ('Glue\nCrawler')
data_catalog = glueDataCatalog ('Glue\nDataCatalog')
jobs = Glue ('Glue Jobs')

With Cluster ('Jobs'):
job = [Glue ('Job for\nActivity\ntransformation'),
Glue ('Job for\nAppointment\ntransformation')]

With Cluster ('Cloudwatch'):
log = Cloudwatch ('\n\n\nMonitoring Scripts')

With Cluster ('S3'):
bucket = simpleStorageServices3BucketWithObjects ('S3 Buckets\nto store\nAWS Glue outputs')

With Cluster ('Objects within S3 bucket'):
obj = [SimpleStorageServices3Object ('Output from\nActivity\nTransformation Job'),
SimpleStorageServices3Object ('Output from\nAppointment\nTransformation Job')]

With Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')

With Cluster ('Devs'):
houcem = Custom ('Houcem\nLead DS', '... /Custom/houcem.png ')
nico = Custom ('Nico\nDS', '... /Custom/nico2.png ')
dev = [nico, houcem]

db >> arrow (color='red') >> data_catalog

< crawler >db << arrow (color='purple', style='dotted', label='Connect DB') <>\
arrow (color='purple', style='dotted', label='to AWS Glue') >> data_catalog >> arrow (color='red') >>\
jobs >> arrow (color='purple') >> job

job >> arrow (color='hotpink', style='dashed') >> log
job >> arrow (color='red') >> bucket >> bucket >> arrow (color='darkgreen') >>\
obj >> arrow (color='red') >> dashboard

houcem >> arrow (color='sandybrown', style='dotted') >> jobs
houcem >> arrow (color='sandybrown', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> jobs
nico >> arrow (color='blue', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> bucket
nico >> arrow (color='blue', style='dotted') >> dashboard

First of all, it should be noted that the code is still more complex than in the various preceding diagrams. Then, concerning the diagram itself, it gives a real detailed overview of the data processing pipeline from the database to the dashboard, all in the AWS environment.

Conclusion

Diagrams is a package that allows pipelines to be represented through diagrams with ease and flexibility. If you want more information and use more advanced controls, I recommend that you look at the github of the Diagrams project and especially in the section Issues.

NB: This article was freely inspired by the article Create Beautiful Architecture Diagrams with Python written by Dylan Roy and available hither

Latest blog posts

Discover our articles on the latest trends, advances, or applications of AI today.

Caroline
Data Scientist
Aqsone
Squad Com'
Technical

Introduction to Retrieval Augmented Generation (RAG)

Learn more
Louis
Data Scientist
Aqsone
Squad Com'
Technical

Interpretability of LLMs: The Role of Sparse Autoencoders

Learn more
Diane
Business Developer
Aqsone
Squad Com'
Innovation

Artificial Intelligence in Industrial Procurement

Learn more