pending - Companies that scales think through with design patterns

Introduction to the companies technology stack helps to provide a glimpse into the foundation of their success story. We want to explore and learn from the giants whom have gone before us. We want to learn from the mistakes they have made that gotten them to the success that they are enjoying right now. In this article we will look at these few companies:

Medium
Netflix
Line
Spotify

These companies are particularly chosen for the scalability issue they have to overcome and how do they go about doing that. Interestingly, as you go about this article, it is crucial for us to understand that technology are only just tools for businesses to achieve their business goals and objectives. No matter how fancy a tool may be, if it does not effectively meet the business goals, it is just bad tool. Here’s a simple photo to illustrate this point:

nail screw source: Frits ahlefldt

In short, choose the right tool for the job

Medium’s architecture

#	Technology	Purpose
1	GraphicsMagick
2	SQS queue processor	background tasks
3	Urban airship	notification
4	SQS	queue processing
5	Bloomd	bloom filters
6	pubsubhubbub & superfeeder	RSS
7	S3	Static assets
8	Nginx + HAProxy	Reverse proxy and load balancing
9	Datadog	Monitoring
10	PagerDuty	Alerting
11	ELK	Debugging production issues
12	TinyMCE	Editing content
13	XCTest and OCMock	Testing
14	Mockito + Robolectric	Mocking test
15	Jenkins	CIDI
16	CloudFront	Content Delivery Network
17	Cloudflare	Static files and Ddos protection
18	Amazon redshift	Data warehouse
19	Go	Image server
20	Node.js	Backend logic
21	Amazon EC2	Backend servers
22	DynamoDB	database
23	Redis cache	database caching
24	Amazon Aurora	Querying

EC2 as their servers
Ansible for system management, keeping configuration under source control
Service oriented architecture,

DynamoDB (nosql db)
One of the most common issues faced when using dynamoDB or any similar Big data database is that it doesnt access your data in a very uniform pattern

Scaling vertically — adding more memory, more CPU — has its limits, so all modern databases get around these problems by scaling horizontally: adding more servers.

Assuming your server can handle up to 1,000 queries per second (QPS). If you need to break up the data into 2 sets, you can essentially get up to 2,000 QPS - as long as we are access the data uniformly.

Imagine you are storing people’s names in a database. You have a lot of names to store and want fast access to them, you decided to scale your database into 26 servers, one server per letter. You will break the data by storing people’s name based on the first letter of their last name. You can potentially get to 26,000 QPS - if the queries are made to all letters uniformly. However if 99% of the requests are made for “Jonathan Yap”, you will mostly access the server containing the letter Y, which means that your actual QPS will not be above 1,000 QPS, where the initial problem started. This is what a hotspot or hotkey is.

In the world of big data, a DynamoDD table might be served by 20-30 servers, but accessing the same key over and over will create hotspots in a smaller subset, causing negative impact on the overall performance of your serivce.

Detecting hotkeys with ELK

Redis cache infront of the database

Amazon Aurora for querying

Neo4j to saving relationship between nodes of data.

A native graph database, it does not need to run through index because they store these edges in their database

People, posts, tags, and collections are nodes in the graphs. Edges are created on entity creation and when people perform actions such as follow, recommend, and highlight.

node.js for backend logic

performance is an issue when they block the event loop
The event loop is what allows Node.js to perform non-blocking I/O operations — despite the fact that JavaScript is single-threaded — by offloading operations to the system kernel whenever possible.

Their way to deal with the performance issue is to run multiple instances per machine, and route expensive endpoints to specific instances, isolating them.

Hooked into V8 runtime to get insights into what ticks are taking a long time, due to object reification during JSON deserialization. This can be done using node js’s built in profiler

Application Performance Management (APM) defines a way to monitor the performance and availability of software applications. It’s standardized by showing charts with performance metrics for things like request count, response times, CPU usage and memory utilization.

https://sematext.com/integrations/nodejs-monitoring/

Sematext is a tool for this nodejs monitoring

Monitor resource utilization:
It should give you a way of improving your application so you don’t spend money, time and resources on unnecessary infrastructure because of your poor coding skills.

Node.js API Latency

What you can do to notice slow services is to collect data on a service level. If you have several APIs, make sure to analyze latency for each of them. This will give you more insight into the real reason why your services are slow.

https://nodejs.org/en/docs/guides/simple-profiling/

## Run the express server first
NODE_ENV=production node --prof app.js

## Try to load test it
curl -X GET "http://localhost:8080/newUser?username=matt&password=password" ab -k -c 20 -n 250 "http://localhost:8080/auth?username=matt&password=password"
## process the tick log file that was generated
node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt

Use this to find out the target of your optimization
Thankfully, you’ve fully internalized the benefits of asynchronous programming and you realize that the work to generate a hash from the user’s password is being done in a synchronous way and thus tying down the event loop.

But when changing a function to asynchronous, you are no longer tying down the eventloop, and thus the worker can proceed to work on other requests while the request is processing.

GO
Easy to build, package, and deploy. They are a fan of using opinionated languages in a team environment; improves consistency, reduce ambiguity, and ultimately gives you less rope to hang yourself

Our image server is now written in Go and uses a waterfall strategy for serving processed images. The servers use groupcache, which provides a memcache alternative while helping to reduce duplicated work across the fleet. The in-memory cache is backed by a persistent S3 cache; then images are processed on demand. This gives our designers the flexibility to change image presentation and optimize for different platforms without having to do large batch jobs to generate resized images.

CloudFront for CDN
Cloudflare for serving static files and DDOS protection. They will send 5% to Fastly and CloudFront to keep cache warm in case of an emergency

Amazon redshit for data warehouse, providing scalale storage and processing systems that other tools are build on.

TinyMCE for editing

Single page application framework that uses closure as a standard library.
Closure Compiler for optimising javascript

The Closure Compiler is a tool for making JavaScript download and run faster. Instead of compiling from a source language to machine code, it compiles from JavaScript to better JavaScript. It parses your JavaScript, analyzes it, removes dead code and rewrites and minimizes what’s left. It also checks syntax, variable references, and types, and warns about common JavaScript pitfalls.

XCTest and OCMock for testing

Mockito and robolectric for writing high level tests that spin up activities

Every commit is automatically pushed to the play store as an alpha build, which goes out to Medium staff right away. This is to allow quick feedback loops. Every friday they will move alpha features into beta to have employees to play with things over the weekend. On monday, they then promote it from beta to production. Since their code is production ready, when a bug is found, they will get the fix out to production immediately. If worried about a new feature, the play time for that beta feature will be alittle longer. When excited, they release even more frequently.

Al clients user server-supplied feature flags, called variants, for A|B testing and guarding unfinished features.

Jenkins for CIDI process. Make for build system but now migrated to Pants.

Combinations of unit tests and HTTP level functional tests.
Work with team at Box to use Cluster RUnner to distribute the tests and make that testing process fast.

Deploy to staging every 15 minutes - and successful builds are then used as candidates for production. Main app servers normally deploy around five times a day, but sometimes as many as 10 times.

We do blue/green deploys. For production we send traffic to a canary instance, and the release process monitors error rates before proceeding with the deploy. Rollbacks are internal DNS flips.

Netflix’s architecture

Highly integrated right from the start

Monolith allowed for rapid development and quick changes while the knowledge of the space was still unknonwn. At one point, there were 30 engineers working on the application with more than 300 data tables

Once application evolved from broad service offerings towards being highly specialized, it was time to decompose the monolith to specific services. The decision was not geared towards performance issues but to set boundaries for different domains and thus enabling dedicated teams to develop domain specific services independently. Be ready for swappable data sources. This was crucial to ensure that your applications are not tightly coupled to the particular data source. That decoupling of data sources led them to the Hexagonal Architecture

Please pause to read the Hexagonal Architecture that I’ve covered in my other article.

In Netflix, they have identified the 3 key concepts of their domain logic

Entities which are your movies, shooting locations
Repositories are the interfaces that gives you the methods needed to interact with the entities. Such as UserRepository, it will contain a list of methods to communicate with data source.
Interactors are classes that performs the domain actions. They will implement complex business rules such as onboarding a production.

The data source adapters know takes the interfaces defined by repositories to store their implementation to the different data sources. Whatever that source is, the entities does not need to know about it.

With the proper implementation of this feature, netflix was able to change their data source from JSON api to GraphQL data source within 2 hours. Wow. Amazing stuff and that makes sense since all the technical implementations of the JSON api resides in the data source, the business logic is completely ignorant of whatever implementation that is in place.

Instead of pinging actual services, they relied on contract testing

At the end of the day, the Hexagonal Architecture sets up Netflix to be in a great position to consider alternative data sources to swap and enjoy the flexibility of that change without a single worry about the core logic of their application. At the begging of the project, since thats when they have the least information about the system they are building, the team should not lock themselves into an architecture with uninformed decisions leading to a project paradox.

The best part of the Hexagonal Architecture is that it keeps the Netflix application flexible for future requirements. As with all projects, a well thought through architecture is half the battle won.

Line’s live stream chat Architecture

https://engineering.linecorp.com/en/blog/detail/85/

The challenge is with overwhelming comments and chat messages in one single stream. The solution we chose to apply to this problem was to segment the users into separate “chat rooms” so that users can only chat with other users that are in the same chat room. As the chat rooms are distributed among multiple servers, the users’ connections are also distributed among multiple servers even if they are in the same chat room.

Three key components to achieve such a high throughput demand:

Websocket to communicate between the client and server

Not only that, sending an HTTP request is no longer required every time a user sends a comment, allowing more resources to be used efficiently. Since websocket is a persistent connection once established,

High speed parallel between servers using Akka toolkit

https://akka.io/

https://www.youtube.com/watch?v=efoRiFrkTAA

Understanding the Akka actor system is important to understand how this tool is utilized to meet the technical situation we currently have as well.

You cannot guarantee the sequence of message:

An actor can have different addresses, and one address need not point to one single actor.

Supervision = the state of an actor is monitored and managed by another actor (the Supervisor)

Constantly monitors running state of actor
Can perform actions based on the state of the actor (the unhandled error)
- The error is escalated to the supervisor to take action. Like how an operations supervision hierarchy works.

Transparent lifecycle management

There is a queue which is the mailbox that is sitting infant of the actor, and is located together with the actor under the same address. When the actor dies, the mailbox will be retained for referenced once actor restarts.

Use cases:

Processing pipeline. There is some sort of processing required. Once that is done, it will push out a message.
Streaming data. You don’t need a response for every message you receive, you can filter
Multi-user concurrency. Actors are persistent
System high uptime requirement. Since there is a transparent lifecycle, once an actor encounters an error, the supervisor will just restart the actor and performs self-heal.
Application with shared state. Since there is a queue infront of each actors, those requests are done sequentially.

Anti use cases:

You ar working on a non-concurrent system
Performance critical applications. They are fast but they are abstractions that works on top of processes and threads. Actors abstract all those performance from you and just gets it done when it is available too.
Non-concurrent communication is involved. The communication has to be done sequentially with dependency on another.
There is no mutable state.

Drawbacks to be aware:

Too many actors. Actors are synchronous. They are dealing with the messages in their mailbox one at a time.
Testing can be tough to test since you have to be clear which actor should be executing what. But perhaps a blackbox test will make more sense.
Debugging can be tough with state and concurrency. And since actors are just abstractions. In real life, when you have a thread going crazy, you do not actually know which actor is going crazy. And even when you are to bring that thread down, you’re not able to be certain that that actor will similarly just be created on another thread.

Nodejs is not applicable for the actor model given the single thread nature.

Find blog:
Don’t use actors for concurrency.

Comment synchronisation between servers by using Redis

The Redis cluster is used for comment synchronisation between servers and for temporary storage for comments and metrics data. Users from the same chat room are connected to the different servers and all the comments generated from the same chat room are synchronized using the Redis Pub/Sub feature.

Once the stream ends, all the data are migrated into MYSQL for permanent storage.

—–

Spotify

https://labs.spotify.com/2020/02/24/how-spotify-aligned-cdn-services-for-a-lightning-fast-streaming-experience/

Spotify’s CDN solution with Akamai and AWS for business-critical content, such as audio streaming, was performing well and had been honed to achieve low latency and high bandwidth

Akamai?
What is fastly’s edge cloud platform that it could be used for delivering of audio streaming. Fastly has its own computing language, Varnish Configuration Language, to perform intelligent caching, pushing application logic to the network edge. Tailor user experienced based on various factors such as location, language and device type. Fastly’s Edge Dictionaries to maintain key. Store value that cannot be read by human and only referenced in VCL.

Can consider using smoke test for e2e tests to ensure that your logic every step of the way is working as intended. Else you can always just rollback and debug.

Services that handle personal data are tagged to ensure they can better maintain GDPR compliance, and promote appropriate defaults with respect to caching and purging.

Reference:

Github resource for scalability issues

Love life. Love challenges. Love basketball. Always hungry to learn.

Medium’s architecture

node.js for backend logic

Netflix’s architecture

Spotify