Part I - How to create your own publishing site like Medium?

SATURDAY, JUNE 20, 2020    

Have you ever wondered how does all the big technology companies deal with scalability issues as they deliver high quality and lightning speed services to people all over the world? Introduction to the companies technology stack helps to provide a glimpse into the foundation of their success story. We want to explore and learn from the giants whom have gone before us. We want to learn from the mistakes they have made that gotten them to the success that they are enjoying right now. In this article we will look at Medium an online publishing platform.


This company was chosen for learning due to the scalability issue they have to overcome and how do they go about doing that. Interestingly, as you go about this article, it is crucial for us to understand that technology are only just tools for businesses to achieve their business goals and objectives. No matter how fancy a tool may be, if it does not effectively meet the business goals, it is just bad tool. Here’s a simple photo to illustrate this point:


nail screwsource: Frits ahlefldt


Choose the right tool for the job


In this article, once we learnt about how a technology giant sets up their architecture for a global audience, we will talk about how we can also create our own high performance content site. We will be covering the 2 distinctive topics:


#1. What is Medium


I will be going through what Medium is made of and what is its technological stack that enables them to scale for the millions of viewers per day. In this portion, we will delve in deep into the different aspects of its technology stack including the virtual server chosen, frontend tool, to the CICD deployment pipeline for its deployment.


#2. What is Medium made up of?


Image a layered cake. Every layer contains its own flavour and ingredients. And more importantly, each of these layers serves to provide you the overall texture and taste in your one single mouth. What we will do is to take the layers apart and savour each of them.


Source from tasting table


We will be breaking down its architecture into a few layers:

  • Physical server layer
  • Frontend layer
  • Backend layer
  • Data layer
  • Image Service layer
  • Testing Service layer
  • Continuous Integration Continuous Deployment layer
  • Configuration management layer

Note: We will be splitting this post into two separate articles. The first part will help us to understand how does this tech giant set up their infrastructure to allow for such high performance, high speed site. The second part, we will be going through how we can actually create our own online publishing site using static site generators. If you wish to jump right into the second part, you can check it out here.




#1. What is Medium?


Medium is an online publishing platform developed for people like you and me to write our learnings and seeks to provide a platform for content creation and sharing. If you are reading this post, there is a good chance you know what medium is. I would say their @alltopstartups medium tagged posts are really insightful and most of these posts are purposeful as well.


In order for them to reach out to the millions of daily users like you and i, there are a massive lot of work and technological tools required to get the whole architecture in place. Remember what we mentioned about using the right tool for the job? When you truly understand the different components of a scalable architecture, you will be intentional to optimise every part of the workflow. And i believe medium did just that. I always wonder if too many individual components may incur more cost than benefits from the maintenance standpoint, but that is definitely another topic to delve into as we go along.



#2. What is Medium made up of?


For their technology stack, listed here is a non-exhaustive number of technologies Medium used. In every world class quality applications, there will always be one top priority to achieve: speed.


To achieve speed, you have to find out your bottle necks and learn how to leverage your tools to overcome a large suite of issues. In medium’s case, this is a classic example of utilizing a specific and specialized tool for a particular part of the work process. We will delve in deeper into the use cases of each of these tools as well.


# Technology Purpose
1 GraphicsMagick
2 SQS queue processor background tasks
3 Urban airship notification
4 SQS queue processing
5 Bloomd bloom filters
6 pubsubhubbub & superfeeder RSS
7 S3 Static assets
8 Nginx + HAProxy Reverse proxy and load balancing
9 Datadog Monitoring
10 PagerDuty Alerting
11 ELK Debugging production issues
12 TinyMCE Editing content
13 XCTest and OCMock Testing
14 Mockito + Robolectric Mocking test
15 Jenkins CIDI
16 CloudFront Content Delivery Network
17 Cloudflare Static files and Ddos protection
18 Amazon redshift Data warehouse
19 Go Image server
20 Node.js Backend logic
21 Amazon EC2 Backend servers
22 DynamoDB database
23 Redis cache database caching
24 Amazon Aurora Querying



Physical server layer


  • EC2 as their servers

As all of you know, EC2 is famously known as the servers AWS use to provide cloud services. In my previous [article](/link to second half of may), I’ve added in an article to curates all the free services for developers including good cloud platform! AWS has some free tier services as well, do check them out!




Frontend layer


  • TinyMCE for editing

Have you used wordpress or blogspot before? Whenever you have to write your posts, you need to use one of these editors that allows you to style your words.


Source from tinymce github


  • Closure compiler library

It is a tool for making JavaScript download and run faster. Instead of compiling from a source language to machine code, it compiles from JavaScript to better JavaScript. It parses your JavaScript, analyzes it, removes dead code and rewrites and minimizes what’s left. It also checks syntax, variable references, and types, and warns about common JavaScript pitfalls.




Backend layer


  • Node.js

In short summary, Node.js is a Javascript runtime built on Chrome’s V8 Javascript engine. It is an open source platform Javascript runtime environment that executes JavaScript code outside a web browser. Don’t get confused with it being a coding language. Node.js is an environment to run code that is written in Javascript, which is the coding language.




Data layer


  • Amazon DynamoDB (nosql db)

Generally you want to utilize a nosql db for large volume of data that might have little to no structure. As Amazon themselves did a comprehensive comparison between relational and non-relational database, we can see that non sql db are designed for you to scale out which is definitely a necessary feature for a company that seeks to serve a global audience.


However, every decision and tool comes with their limitations and shortcomings. One of the most common issues faced when using dynamoDB or any similar Big data database is that it doesn’t access your data in a very uniform pattern. This creates an issue called hotspots.


As an illustration, you can always imagine database servers to be like a receptionist. Your receptionist just needs to answer queries about where is the direction towards the different stores in a mall.


“Hi receptionist, may i know where is the nearest toilet?”
Take some time to process the question and proceeds to point the customer
“Hi receptionist, may i know where is McDonald’s?”
Take some time to process the question and proceeds to point the customer


Now as more customers starts to ask your receptionist, you find that the throughput of the server to be rather limited like the sloth in Zootopia


From pinterest and Disney’s


So you decided to spend a bit more to hire a more competent receptionist. As you ask a question, the time taken to process each query is reduced significantly. But as even more users come, as fast as your new receptionist can serve a customer, nevertheless the waiting line is starting to pile. So as a mall crowd manager, you decided to open another counter to address the user’s queries served by an equally competent receptionist as the first. Now you are serving x2 the quantity of queries!


In the same way, in the technical sense, whenever you are experiencing a resource limitation on your server, you have two options:


a. Scale vertically


As described in the example, it’s like you hiring a more competent receptionist. It provides you more capacity to deal with more queries per second. And liken to a real life situation, that usually translate to a higher cost as well.


In the technical sense, this means adding more memory or more CPU resources into an existing server so it is able to have higher compute power and handle more queries. But this approach has its limitations, a database as much resources might be allocated, it will typically still have a ceiling on the number of queries it can serve per unit time. Thus all modern databases address these problems by choosing to do the next option.


b. Scale horizontally


From the example, scaling horizontally refers to the decision to open up another counter. As you start to head the ceiling of a single database server, you will then open more counters to serve more customers as just efficiently.


In the technical sense, you are increasing the quantity of database servers so you can serve more quantities of request per second. This act of adding additional serves is called scaling horizontally. In Naruto, they call this Kage Bunshin No Jutsu.



Now, here’s the problem faced by the servers!


Assuming your server can handle up to 1,000 queries per second (QPS). If you need to break up the data into 2 sets, you can essentially get up to 2,000 QPS - as long as we are access the data uniformly.


As a more realistic example, imagine you are storing people’s names in a database. You have a lot of names to store and want fast access to them, you decided to scale your database into 26 servers, one server per letter. You will break the data by storing people’s name based on the first letter of their last name. You can potentially get to 26,000 QPS - if the queries are made to all letters uniformly. However if 99% of the requests are made for “Jonathan Yap”, you will mostly access the server containing the letter Y, which means that your actual QPS will not be above 1,000 QPS, where the initial problem started. This is what a hotspot or hotkey is.


In the world of big data, a DynamoDD table might be served by 20-30 servers, but accessing the same key over and over will create hotspots in a smaller subset, causing negative impact on the overall performance of your service. Here’s an article that talks about this problem. Check it out here.


I personally think data warehouse and generally data related problems are really intriguing and another domain all together. We will not be going through much of it here but will definitely revisit these topics in the near future. If you are a software engineer and you’re with data, this is a space you have to understand and get familiar with.


  • Redis cache in front of the database

  • Amazon Aurora for querying

  • Amazon redshift for data warehouse

  • Neo4j used to save relationships between the data nodes.


You can attempt to find out more about Neo4j here.


This tool is really interesting and its value and beauty can only be properly appreciated when we are able to first understand the nature of the the problem they are attempting to solve. Apart from the need of server or Database which forms the essential components of any application, Neo4j to allow you to perform quick queries of data by addressing the limitations of relational database. This is interesting and will definitely delve deeper to understand this tool when the opportunity arises.


Essentially, it is a native graph database and it does not need to run through index because they store these edges in their database.


In the case of Medium, people, posts, tags, and collections are nodes in the graphs. Edges are created on entity creation and when people perform actions such as follow, recommend, and highlight. Here’s a video for you to have a deeper understanding about NIX though there is a disclaimer that it can be really technical and difficult for non-technical readers to understand. I’ve added in a youtube video for curious readers to understand this tool better.





Image service layer


  • GoLang

Easy to build, package, and deploy. They are a fan of using opinionated languages in a team environment. As such, it improves consistency, reduce ambiguity, and ultimately gives you less rope to hang yourself.


Medium’s image servers are now written in Go and uses a waterfall strategy for serving processed images. The servers use groupcache, which provides a memcache alternative while helping to reduce duplicated work across the fleet. The in-memory cache is backed by a persistent S3 cache; then images are processed on demand. This gives our designers the flexibility to change image presentation and optimize for different platforms without having to do large batch jobs to generate resized images.


  • CloudFront for Content Delivery Network (CDN)

Cloudflare for serving static files and DDoS protection. They will send 5% to Fastly and CloudFront to keep cache warm in case of an emergency. Generally you want to use to improve page load time by addressing the issue of geographical limitations. I’ve attached a short youtube video i find useful to help bring clarity to the use case of this tool.





Testing


  • XCTest and OCMock for testing

  • Mockito and robolectric for writing high level tests that spin up activities

If you want a good quality product, you have to iterate your quality over numerous times. No good product is at it’s world-class quality on the first try; it has already been iterated and refined over and over in order to get its tiptop quality. In the case of software quality, if you want to iterate fast, you’ve got to have test coverages.


How do you know if you changing one part of the software’s code does not actually break another part of the product? Manual testing by users will take huge amount of time that iteration will be impossible. If you can write code to automate your services and processes, why not write code to automate your testing as well?


Apparently in Medium’s article here’s their iteration story to achieve such a high quality product:


Every commit is automatically pushed to the play store as an alpha build, which goes out to Medium staff right away. This is to allow quick feedback loops. Every friday they will move alpha features into beta to have employees to play with things over the weekend. On monday, they then promote it from beta to production. Since their code is production ready, when a bug is found, they will get the fix out to production immediately. If worried about a new feature, the play time for that beta feature will be a little longer. When excited, they release even more frequently.




CICD pipeline


  • Jenkins for Continuous Integration Continuous Deployment (CICD) process. Make for build system but now migrated to Pants.

I personally find CICD to be so amazing and revolutionary in the space of software development. Many people have already covered what is CICD and the concepts under its umbrella.


Here’s one medium article sharing quite comprehensively about this topic. Check it out here.




Configuration management


  • Ansible for system management, keeping configuration under source control

I really love Ansible as a config management tool. This is usually utilized for DevOps engineers or practically anyone whom are required to manage a fleet of servers. This can be done with a click of a button, and Ansible focuses on achieving the desired state of server instead of just blindly running scripts which could easily result in duplicated states because the script was executed twice. For example. i want to run a script that allows me to add 127.0.0.1 localhost into my host file under /etc/hosts


## /etc/hosts
127.0.0.1		localhost

What if I run this script twice? I will get this undesirable outcome of having a duplicated entry as seen below.


## /etc/hosts
127.0.0.1		localhost
127.0.0.1		localhost

Ansible allows us to overcome these issues by checking the desired state of the server’s host file. It will check whether the entry 127.0.0.1 has already been included in the file. If so, it will skip because the desired state of having 1 localhost entry has already exist in the file. If it does not exist, then it will subsequently add that line in to achieve that desired state. This is a high level understanding of the possibilities with Ansible.


Over here, we have just gone through in a rather thorough explanation into the Medium’s technological stack and attempted to understand the different use cases that the tool is trying to address. Some may see it as simply a mash of technology thrown into the mix and stir. But if we are truly able to understand and appreciate the work process of delivering high speed online publishing services to people all around the world, we can also then appreciate the problems that the different tools seeks to address.




TLDR;


There is always no magic behind good software. There will always be a lot of plumbing work required to ensure that the service provided is optimized and served to end users at it’s tiptop performance. In order to do that requires you to delve deeper to understand the workflow of the business, then find the right components equipped to serve that purpose. For most common and average applications, we always find multi-purposed tools to execute the different layers since it will be more cost efficient for a smaller team to deal with less technologies. As you seek to enter into the global space, here’s where the split has to happen!


In our next part, we will be going through how you can actually go about creating your own online publishing site! Check it out here.




Resources: