This article is about Technical diagram In the Data Engineering projects, looking at solutions such as Google Slides and Draw.io. It highlights the alignment challenges and concludes with a discussion on how to simplify this process.
In a Data Engineering project, to have a complete representation of the project, a technical diagram summarizing the entire pipeline is necessary. How do you quickly and easily create a technical diagram that summarizes a project's data pipeline? This is a question that comes up frequently when it comes time to represent your pipeline. It is possible to do this directly on Google Slides by inserting shapes and arrows, but it can quickly become a pain when you have to resize the different parts, make them move together, or align the text. There are also tools, like draw.io, that facilitate the creation of diagrams by linking the different parts, however alignment problems persist and the pipeline will be created only with shapes.
To avoid all these problems and limitations, it is possible to use the Diagrams package which, in a few lines of code, will produce an easily readable technical diagram. In addition, creating a technical diagram with code makes it possible to reuse what has been done and if several people work collaboratively on the same diagram, it makes it easy to use a version control tool.
We will see in detail that Diagram is a flexible tool that makes it easy to produce technical diagrams while maintaining some clarity for readers. We are going to take a step-by-step look at how to use this package and its features.
To be able to use the diagrams package, it is necessary to have Python 3.6 or higher. Then, you will have to install GraphViz because it is what allows you to display the charts. You can find GraphViz in the “Getting Started” section of github for the Diagram project.
Then you can install the diagrams library with your package manager, and then you'll be ready to start creating beautiful diagrams.
For my part I installed the package with pip
Pip Install Diagrams
In this package there are 4 different elements:
The first 3 elements are characterized by a class respectively. Regarding nodes, there are many classes offered by different providers such as AWS, Azure or GCP for clouds or Kubernetes. You can find all the classes in the official package documentation.
Finally, these 4 elements are linked: in fact, a diagram consists of nodes that can be grouped together in groups and that are linked together by links. You will therefore have to import the necessary classes in order to be able to represent your architecture diagram correctly.
Now, let's try to code a first diagram to understand the basics of the package.
From
Diagrams
Import
Diagram
From
diagrams.aws.analytics
Import
Glue, Quicksight
From
diagrams.aws.database
Import
RDS
From
diagrams.aws.management
Import
Cloudwatch
From
diagrams.aws.storage
Import
S3
With
Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
db = RDS ('RDS Database')
jobs = Glue ('ETL')
log = Cloudwatch ('Logging')
bucket = S3 ('S3 Buckets')
dashboard = Quicksight ('Dashboard')
db >> jobs >> bucket >> dashboard
jobs >> log
This diagram describes a data engineering pipeline, using a table contained in an RDS database, processed using AWS Glue ETL. The processing results are stored in an S3 bucket and the logs are stored in Cloudwatch. Finally, a Quicksight dashboard is connected to the S3 bucket.
Let's look at this first piece of code in detail. First, we import the Diagram class, which is required to produce a diagram. Then we import a few node classes from the AWS provider, for example RDS, Glue, etc.
We then create a new Diagram with the noun 'Pipeline — Global Overview'. As we filled in the parameter filename, the diagram will be saved at the location indicated, attention the path indicated is a relative path (the root will be the same as that of the location where the code is executed, for example if the code is launched from the desktop, the diagram will be saved on the desktop) and not an absolute path. The parameter Show being equal to True, Python will open the graph immediately after the code is executed. The parameter managements indicates in which direction the graph will be built, here it will be from left to right (from Left to Right), which is the default setting. The other options are right to left (RL), top to bottom (TB), and bottom to top (BT). Inside the diagram, we create several nodes, using the classes we imported. For create a link between two nodes, you have to add '>>' between the two nodes if you want the arrow to go from left to right or '<<' if necessary.
In order to finish with all the classes in the package, let's try a slightly more complex diagram integrating clusters.
From
Diagrams
Import
Diagram, Cluster
From
diagrams.aws.analytics
Import
Glue, Quicksight
From
diagrams.aws.database
Import
RDS
From
diagrams.aws.management
Import
Cloudwatch
From
diagrams.aws.storage
Import
S3
With
Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
db = RDS ('RDS Database')
With
Cluster ('AWS Glue (ETL)\nData Engineering\n(Filter, join, rename... ) '):
jobs = [Glue ('Job1'), Glue ('Job2')]
log = Cloudwatch ('Logging')
bucket = S3 ('S3 Buckets')
dashboard = Quicksight ('Dashboard')
db >> jobs >> bucket >> dashboard
jobs >> log
This diagram describes the same pipeline as the previous one, the only difference is that here there are two Jobs that are represented in AWS Glue.
The main purpose of Clusters is to group similar elements into the same subset.
The second use that I find for clusters is to be able to delineate the different parts of the pipeline even more clearly, as shown in the following example:
From
Diagrams
Import
Diagram, Cluster
From
diagrams.aws.analytics
Import
Glue, Quicksight
From
diagrams.aws.database
Import
RDS
From
diagrams.aws.management
Import
Cloudwatch
From
diagrams.aws.storage
Import
S3
With
Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With
Cluster ('RDS'):
db = RDS ('PostgreSQL BDD\nstored in RDS')
With
Cluster ('AWS Glue (ETL) '):
jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')
With
Cluster ('Cloudwatch'):
log = Cloudwatch ('Monitoring Scripts')
With
Cluster ('S3'):
bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')
With
Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')
db >> jobs >> bucket >> dashboard
jobs >> log
The code will produce the same diagram again, but the appearance will be different, in fact each part of AWS that was used will be even more clearly identified.
Now that we know how to use diagrams, clusters, clusters, edges, and nodes, let's focus on customization. There are two customizable objects: nodes and edges.
First, let's look at how to customize the edges. There are 3 customization parameters: color, style and label. The default color is gray, but if you want to set another color, the colors are those used by the matplotlib package that you can find </a >here.
Then, it is possible to play on the style and there are 4 available:
It is not possible to combine different styles, i.e. have a line made of bold dashes.
Finally, it is possible to add a label to an Edge if you want to explain what this Edge represents.
Let's see what that looks like in code.
From
Diagrams
Import
Diagram, Cluster, Edge
From
diagrams.aws.analytics
Import
Glue, Quicksight
From
diagrams.aws.database
Import
RDS
From
diagrams.aws.management
Import
Cloudwatch
From
diagrams.aws.storage
Import
S3
Def
arrow (color='black', style='line', label=None):
“"”
Function to define the edge between the part of the diagram
:param color: the color of the edge, could be any color
:type color: str
:param style: the style of the edge, could be dashed, dotted, bold or line (default)
:type style: str
:param label: The text you want to show on the edge
:type label: str
:return: Edge object with the different parameters we set up
:type: Edge ()
“"”
Return
Edge (color=color, style=style, label=label)
With
Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With
Cluster ('RDS'):
db = RDS ('PostgreSQL BDD\nstored in RDS')
With
Cluster ('AWS Glue (ETL) '):
jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')
With
Cluster ('Cloudwatch'):
log = Cloudwatch ('Monitoring Scripts')
With
Cluster ('S3'):
bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')
With
Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')
db_mycharlotte >> arrow (color='red', style='bold') >> jobs >> arrow (color='red', style='bold') >>\
bucket >> arrow (color='red', style='bold') >> dashboard
jobs >> arrow (color='hotpink', style='dashed') >> log
Here, I created a function, arrow (), which by default produces a bold black arrow, which I prefer to the package's default arrow. I then use this function to define the various arrows I want to have in my chart. When you want to customize an Edge, you must explicitly mark it in the diagram diagram between the two nodes concerned. Here I wanted the pipeline Edges to be in red and in bold, except for the arrow for the logs which is in pink and dotted lines. You can see that in the last two lines of the code.
Let's talk about the second point, node customization. What does that mean? Customizing nodes means displaying a Node with an image that is not already pre-registered in the package image bank and therefore creating a Node that does not exist.
Do you want to represent the sending of an email in case of error? This is not in the options available on the package, but all you need to do is upload an image representing an email and using the Node Custom (), you can integrate this new Node into your diagram. So there are a lot of Node possibilities and the only limit is your imagination.
From
Diagrams
Import
Diagram, Cluster, Edge
From
diagrams.aws.analytics
Import
Glue, Quicksight
From
diagrams.aws.database
Import
RDS
From
diagrams.aws.management
Import
Cloudwatch
From
diagrams.aws.storage
Import
S3
With
Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With
Cluster ('RDS'):
db = RDS ('PostgreSQL BDD\nstored in RDS')
With
Cluster ('AWS Glue (ETL) '):
jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')
With
Cluster ('Cloudwatch'):
log = Cloudwatch ('Monitoring Scripts')
With
Cluster ('S3'):
bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')
With
Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')
With
Cluster ('Devs'):
houcem = Custom ('Houcem\nLead DS', '... /Custom/houcem.png ')
nico = Custom ('Nico\nDS', '... /Custom/nico.png ')
dev = [nico, houcem]
db >> arrow (color='red', style='bold') >> jobs >> arrow (color='red', style='bold') >> bucket >>\
arrow (color='red', style='bold') >> dashboard jobs >> arrow (color='hotpink', style='dashed') >> log
houcem >> arrow (color='sandybrown', style='dotted') >> jobs
houcem >> arrow (color='sandybrown', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> jobs
nico >> arrow (color='blue', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> bucket
nico >> arrow (color='blue', style='dotted') >> dashboard
Here, I chose to represent the developers who worked on this project by specifying which parts of the pipeline they worked on. To make it more visual, I created two new Nodes with the photos of the developers and so, it is clear who to contact in case of problems with the pipeline.
Finally, once you have mastered the various functionalities of the package, you can produce very rich diagrams. Here is an example:
From
Diagrams
Import
Diagram, Cluster, Edge
From
diagrams.aws.analytics
Import
Glue, GlueCrawlers, GlueDataCatalog, Quicksight
From
diagrams.aws.database
Import
RDS
From
diagrams.aws.management
Import
Cloudwatch
From
diagrams.aws.storage
Import
S3, SimpleStorageServices3Object, SimpleStorageServices3Object, SimpleStorageServices3Object
From
graphs.custom
Import
Custom
With
Diagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):
With
Cluster ('RDS'):
db_mycharlotte = RDS ('PostgreSQL BDD\nstored in RDS')
With
Cluster ('AWS Glue'):
crawler = glueCrawlers ('Glue\nCrawler')
data_catalog = glueDataCatalog ('Glue\nDataCatalog')
jobs = Glue ('Glue Jobs')
With
Cluster ('Jobs'):
job = [Glue ('Job for\nActivity\ntransformation'),
Glue ('Job for\nAppointment\ntransformation')]
With
Cluster ('Cloudwatch'):
log = Cloudwatch ('\n\n\nMonitoring Scripts')
With
Cluster ('S3'):
bucket = simpleStorageServices3BucketWithObjects ('S3 Buckets\nto store\nAWS Glue outputs')
With
Cluster ('Objects within S3 bucket'):
obj = [SimpleStorageServices3Object ('Output from\nActivity\nTransformation Job'),
SimpleStorageServices3Object ('Output from\nAppointment\nTransformation Job')]
With
Cluster ('Quicksight'):
dashboard = Quicksight ('Dashboard\nfor monitoring')
With
Cluster ('Devs'):
houcem = Custom ('Houcem\nLead DS', '... /Custom/houcem.png ')
nico = Custom ('Nico\nDS', '... /Custom/nico2.png ')
dev = [nico, houcem]
db >> arrow (color='red') >> data_catalog
< crawler >db << arrow (color='purple', style='dotted', label='Connect DB') <>\
arrow (color='purple', style='dotted', label='to AWS Glue') >> data_catalog >> arrow (color='red') >>\
jobs >> arrow (color='purple') >> job
job >> arrow (color='hotpink', style='dashed') >> log
job >> arrow (color='red') >> bucket >> bucket >> arrow (color='darkgreen') >>\
obj >> arrow (color='red') >> dashboard
houcem >> arrow (color='sandybrown', style='dotted') >> jobs
houcem >> arrow (color='sandybrown', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> jobs
nico >> arrow (color='blue', style='dotted') >> log
nico >> arrow (color='blue', style='dotted') >> bucket
nico >> arrow (color='blue', style='dotted') >> dashboard
First of all, it should be noted that the code is still more complex than in the various preceding diagrams. Then, concerning the diagram itself, it gives a real detailed overview of the data processing pipeline from the database to the dashboard, all in the AWS environment.
Diagrams is a package that allows pipelines to be represented through diagrams with ease and flexibility. If you want more information and use more advanced controls, I recommend that you look at the github of the Diagrams project and especially in the section Issues.
NB: This article was freely inspired by the article Create Beautiful Architecture Diagrams with Python written by Dylan Roy and available hither