How do you create beautiful architecture diagrams with Python?

This article is about Technical diagram In the Data Engineering projects, looking at solutions such as Google Slides and Draw.io. It highlights the alignment challenges and concludes with a discussion on how to simplify this process.

Introduction

In a Data Engineering project, to have a complete representation of the project, a technical diagram summarizing the entire pipeline is necessary. How do you quickly and easily create a technical diagram that summarizes a project's data pipeline? This is a question that comes up frequently when it comes time to represent your pipeline. It is possible to do this directly on Google Slides by inserting shapes and arrows, but it can quickly become a pain when you have to resize the different parts, make them move together, or align the text. There are also tools, like draw.io, that facilitate the creation of diagrams by linking the different parts, however alignment problems persist and the pipeline will be created only with shapes.

‍

‍

To avoid all these problems and limitations, it is possible to use the Diagrams package which, in a few lines of code, will produce an easily readable technical diagram. In addition, creating a technical diagram with code makes it possible to reuse what has been done and if several people work collaboratively on the same diagram, it makes it easy to use a version control tool.

We will see in detail that Diagram is a flexible tool that makes it easy to produce technical diagrams while maintaining some clarity for readers. We are going to take a step-by-step look at how to use this package and its features.

Prerequisites

To be able to use the diagrams package, it is necessary to have Python 3.6 or higher. Then, you will have to install GraphViz because it is what allows you to display the charts. You can find GraphViz in the “Getting Started” section of github for the Diagram project.

Then you can install the diagrams library with your package manager, and then you'll be ready to start creating beautiful diagrams.

For my part I installed the package with pip

Pip Install Diagrams

The basics

In this package there are 4 different elements:

Diagram
The groups (Cluster)
The links (Edge)
The knots

The first 3 elements are characterized by a class respectively. Regarding nodes, there are many classes offered by different providers such as AWS, Azure or GCP for clouds or Kubernetes. You can find all the classes in the official package documentation.

Finally, these 4 elements are linked: in fact, a diagram consists of nodes that can be grouped together in groups and that are linked together by links. You will therefore have to import the necessary classes in order to be able to represent your architecture diagram correctly.

Now, let's try to code a first diagram to understand the basics of the package.

From Diagrams ImportDiagramFrom diagrams.aws.analytics ImportGlue, QuicksightFrom diagrams.aws.database ImportRDSFrom diagrams.aws.management ImportCloudwatchFrom diagrams.aws.storage ImportS3WithDiagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”): db = RDS ('RDS Database') jobs = Glue ('ETL') log = Cloudwatch ('Logging') bucket = S3 ('S3 Buckets') dashboard = Quicksight ('Dashboard') db >> jobs >> bucket >> dashboard jobs >> log

This diagram describes a data engineering pipeline, using a table contained in an RDS database, processed using AWS Glue ETL. The processing results are stored in an S3 bucket and the logs are stored in Cloudwatch. Finally, a Quicksight dashboard is connected to the S3 bucket.

‍

‍

Let's look at this first piece of code in detail. First, we import the Diagram class, which is required to produce a diagram. Then we import a few node classes from the AWS provider, for example RDS, Glue, etc.

We then create a new Diagram with the noun 'Pipeline — Global Overview'. As we filled in the parameter filename, the diagram will be saved at the location indicated, attention the path indicated is a relative path (the root will be the same as that of the location where the code is executed, for example if the code is launched from the desktop, the diagram will be saved on the desktop) and not an absolute path. The parameter Show being equal to True, Python will open the graph immediately after the code is executed. The parameter managements indicates in which direction the graph will be built, here it will be from left to right (from Left to Right), which is the default setting. The other options are right to left (RL), top to bottom (TB), and bottom to top (BT). Inside the diagram, we create several nodes, using the classes we imported. For create a link between two nodes, you have to add '>>' between the two nodes if you want the arrow to go from left to right or '<<' if necessary.

In order to finish with all the classes in the package, let's try a slightly more complex diagram integrating clusters.

From Diagrams ImportDiagram, ClusterFrom diagrams.aws.analytics ImportGlue, QuicksightFrom diagrams.aws.database ImportRDSFrom diagrams.aws.management ImportCloudwatchFrom diagrams.aws.storage ImportS3WithDiagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”): db = RDS ('RDS Database')WithCluster ('AWS Glue (ETL)\nData Engineering\n(Filter, join, rename... ) '): jobs = [Glue ('Job1'), Glue ('Job2')] log = Cloudwatch ('Logging') bucket = S3 ('S3 Buckets') dashboard = Quicksight ('Dashboard') db >> jobs >> bucket >> dashboard jobs >> log

This diagram describes the same pipeline as the previous one, the only difference is that here there are two Jobs that are represented in AWS Glue.

‍

‍

The main purpose of Clusters is to group similar elements into the same subset.

The second use that I find for clusters is to be able to delineate the different parts of the pipeline even more clearly, as shown in the following example:

From Diagrams ImportDiagram, ClusterFrom diagrams.aws.analytics ImportGlue, QuicksightFrom diagrams.aws.database ImportRDSFrom diagrams.aws.management ImportCloudwatchFrom diagrams.aws.storage ImportS3WithDiagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):WithCluster ('RDS'): db = RDS ('PostgreSQL BDD\nstored in RDS')WithCluster ('AWS Glue (ETL) '): jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')WithCluster ('Cloudwatch'): log = Cloudwatch ('Monitoring Scripts')WithCluster ('S3'): bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')WithCluster ('Quicksight'): dashboard = Quicksight ('Dashboard\nfor monitoring') db >> jobs >> bucket >> dashboard jobs >> log

The code will produce the same diagram again, but the appearance will be different, in fact each part of AWS that was used will be even more clearly identified.

‍

‍

Advanced settings

Now that we know how to use diagrams, clusters, clusters, edges, and nodes, let's focus on customization. There are two customizable objects: nodes and edges.

Customizing Edges

First, let's look at how to customize the edges. There are 3 customization parameters: color, style and label. The default color is gray, but if you want to set another color, the colors are those used by the matplotlib package that you can find </a >here.

Then, it is possible to play on the style and there are 4 available:

The default style which is a solid line
Bold, a solid line in bold
Dashed, the line is made of dashes
Dotted, the line is made of dots

It is not possible to combine different styles, i.e. have a line made of bold dashes.

Finally, it is possible to add a label to an Edge if you want to explain what this Edge represents.

Let's see what that looks like in code.

From Diagrams ImportDiagram, Cluster, EdgeFrom diagrams.aws.analytics ImportGlue, QuicksightFrom diagrams.aws.database ImportRDSFrom diagrams.aws.management ImportCloudwatchFrom diagrams.aws.storage ImportS3Defarrow (color='black', style='line', label=None):“"” Function to define the edge between the part of the diagram :param color: the color of the edge, could be any color :type color: str :param style: the style of the edge, could be dashed, dotted, bold or line (default) :type style: str :param label: The text you want to show on the edge :type label: str :return: Edge object with the different parameters we set up :type: Edge () “"”ReturnEdge (color=color, style=style, label=label)WithDiagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):WithCluster ('RDS'): db = RDS ('PostgreSQL BDD\nstored in RDS')WithCluster ('AWS Glue (ETL) '): jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')WithCluster ('Cloudwatch'): log = Cloudwatch ('Monitoring Scripts')WithCluster ('S3'): bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')WithCluster ('Quicksight'): dashboard = Quicksight ('Dashboard\nfor monitoring') db_mycharlotte >> arrow (color='red', style='bold') >> jobs >> arrow (color='red', style='bold') >>\ bucket >> arrow (color='red', style='bold') >> dashboard jobs >> arrow (color='hotpink', style='dashed') >> log

Here, I created a function, arrow (), which by default produces a bold black arrow, which I prefer to the package's default arrow. I then use this function to define the various arrows I want to have in my chart. When you want to customize an Edge, you must explicitly mark it in the diagram diagram between the two nodes concerned. Here I wanted the pipeline Edges to be in red and in bold, except for the arrow for the logs which is in pink and dotted lines. You can see that in the last two lines of the code.

‍

‍

Node Customization

Let's talk about the second point, node customization. What does that mean? Customizing nodes means displaying a Node with an image that is not already pre-registered in the package image bank and therefore creating a Node that does not exist.

Do you want to represent the sending of an email in case of error? This is not in the options available on the package, but all you need to do is upload an image representing an email and using the Node Custom (), you can integrate this new Node into your diagram. So there are a lot of Node possibilities and the only limit is your imagination.

From Diagrams ImportDiagram, Cluster, EdgeFrom diagrams.aws.analytics ImportGlue, QuicksightFrom diagrams.aws.database ImportRDSFrom diagrams.aws.management ImportCloudwatchFrom diagrams.aws.storage ImportS3WithDiagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):WithCluster ('RDS'): db = RDS ('PostgreSQL BDD\nstored in RDS')WithCluster ('AWS Glue (ETL) '): jobs = Glue ('Data Engineering\n(Filter,\njoin,\nrename..) ')WithCluster ('Cloudwatch'): log = Cloudwatch ('Monitoring Scripts')WithCluster ('S3'): bucket=S3 ('S3 Buckets\nto store\nAWS Glue outputs')WithCluster ('Quicksight'): dashboard = Quicksight ('Dashboard\nfor monitoring')WithCluster ('Devs'): houcem = Custom ('Houcem\nLead DS', '... /Custom/houcem.png ') nico = Custom ('Nico\nDS', '... /Custom/nico.png ') dev = [nico, houcem] db >> arrow (color='red', style='bold') >> jobs >> arrow (color='red', style='bold') >> bucket >>\ arrow (color='red', style='bold') >> dashboard jobs >> arrow (color='hotpink', style='dashed') >> log houcem >> arrow (color='sandybrown', style='dotted') >> jobs houcem >> arrow (color='sandybrown', style='dotted') >> log nico >> arrow (color='blue', style='dotted') >> jobs nico >> arrow (color='blue', style='dotted') >> log nico >> arrow (color='blue', style='dotted') >> bucket nico >> arrow (color='blue', style='dotted') >> dashboard

Here, I chose to represent the developers who worked on this project by specifying which parts of the pipeline they worked on. To make it more visual, I created two new Nodes with the photos of the developers and so, it is clear who to contact in case of problems with the pipeline.

‍

‍

Finally, once you have mastered the various functionalities of the package, you can produce very rich diagrams. Here is an example:

From Diagrams ImportDiagram, Cluster, EdgeFrom diagrams.aws.analytics ImportGlue, GlueCrawlers, GlueDataCatalog, QuicksightFrom diagrams.aws.database ImportRDSFrom diagrams.aws.management ImportCloudwatchFrom diagrams.aws.storage ImportS3, SimpleStorageServices3Object, SimpleStorageServices3Object, SimpleStorageServices3ObjectFrom graphs.custom ImportCustomWithDiagram ('Pipeline - Global Overview', filename='Diagramm/Pipeline_Go', show=True, direction="LR”):WithCluster ('RDS'): db_mycharlotte = RDS ('PostgreSQL BDD\nstored in RDS')WithCluster ('AWS Glue'): crawler = glueCrawlers ('Glue\nCrawler') data_catalog = glueDataCatalog ('Glue\nDataCatalog') jobs = Glue ('Glue Jobs')WithCluster ('Jobs'): job = [Glue ('Job for\nActivity\ntransformation'), Glue ('Job for\nAppointment\ntransformation')]WithCluster ('Cloudwatch'): log = Cloudwatch ('\n\n\nMonitoring Scripts')WithCluster ('S3'): bucket = simpleStorageServices3BucketWithObjects ('S3 Buckets\nto store\nAWS Glue outputs')WithCluster ('Objects within S3 bucket'): obj = [SimpleStorageServices3Object ('Output from\nActivity\nTransformation Job'), SimpleStorageServices3Object ('Output from\nAppointment\nTransformation Job')]WithCluster ('Quicksight'): dashboard = Quicksight ('Dashboard\nfor monitoring')WithCluster ('Devs'): houcem = Custom ('Houcem\nLead DS', '... /Custom/houcem.png ') nico = Custom ('Nico\nDS', '... /Custom/nico2.png ') dev = [nico, houcem] db >> arrow (color='red') >> data_catalog < crawler >db << arrow (color='purple', style='dotted', label='Connect DB') <>\ arrow (color='purple', style='dotted', label='to AWS Glue') >> data_catalog >> arrow (color='red') >>\ jobs >> arrow (color='purple') >> job job >> arrow (color='hotpink', style='dashed') >> log job >> arrow (color='red') >> bucket >> bucket >> arrow (color='darkgreen') >>\ obj >> arrow (color='red') >> dashboard houcem >> arrow (color='sandybrown', style='dotted') >> jobs houcem >> arrow (color='sandybrown', style='dotted') >> log nico >> arrow (color='blue', style='dotted') >> jobs nico >> arrow (color='blue', style='dotted') >> log nico >> arrow (color='blue', style='dotted') >> bucket nico >> arrow (color='blue', style='dotted') >> dashboard

First of all, it should be noted that the code is still more complex than in the various preceding diagrams. Then, concerning the diagram itself, it gives a real detailed overview of the data processing pipeline from the database to the dashboard, all in the AWS environment.

‍

‍

Conclusion

Diagrams is a package that allows pipelines to be represented through diagrams with ease and flexibility. If you want more information and use more advanced controls, I recommend that you look at the github of the Diagrams project and especially in the section Issues.

NB: This article was freely inspired by the article Create Beautiful Architecture Diagrams with Python written by Dylan Roy and available hither

How do you create beautiful architecture diagrams with Python?

Introduction

Prerequisites

The basics

Advanced settings

Customizing Edges

Node Customization

Conclusion

Latest blog posts

Interpretability of LLMs: The Role of Sparse Autoencoders

Artificial Intelligence in Industrial Procurement

Discover Nicolas with his Chinese portrait