Quickly building an highly scalable and customizable real-time chatting platform on AWS
In today’s world almost every other app needs a real time chat. Here, I will explain you how I build the system for a product company using AWS
I won’t be using MQTT, XMPP or any other chatting solutions as all of them have a learning curve and system restrictions

Quick scalability is required for any social app in today’s world
Chat platform
Primary requirements
- There are users with different roles who can communicate in a P2P manner or in a group
- Messaging should be delivered instantaneously if the receiver is online
- If the receiver is offline, he should get the message immediately when he comes online; if not when the app is not in use
Scalable
Which all parameters should be scalable
- Users
- User sessions
- Chat Groups
- Concurrent connections
- Stored messages
- Real-time messages
Customizable
As a startup you never know how the requirement might change and the system should be prepared to make the changes in the blink of an eye and hence, I did not opt in for ready made solutions such as MQTT and XMPP.
Such solutions have their own learning curve and cannot be modified as much we want them to be
My requirements
- Group chat system — 2 or more than 2 members in the chat system
- Timer — that can be started, played, paused, or stopped
- Payment — the user can be charged
Solution
Without further ado, I will get into the exact architecture that I built with the reasons behind it
Cloud Solution
AWS — It is a pretty straight forward answer at that point in time as other platforms such as GCP and Azure are quite naive. Even as of today I would prefer AWS over others from a stability point of view. (However, GCP’s ML solutions are way better than AWS’s)
DB
DynamoDB — It was obvious that I needed a NoSQL DB as a startup at times you need to add 10 attributes to a class within a day.
I used DynamoDB over Mongo DB as AWS provided Dynamo as an out of the box managed solution where in I did not have to worry about the up-time, back-up, capacity,… They provided almost infinite read-writes per second(well not literally)
All that I had to do was to increase or decrease the read-write capacity based on the index.
Now there was a constraint on Dynamo that irritated me at first; i.e. I can have only 6 indexes per table. But actually thanks to such constraints I ended up designing my DB in much optimized manner as compared to if I had given a choice to create as many indexes I want.
Controller
EC2 with Node.js — Well, that was a straightforward solution, to use EC2 instance behind their auto-scaling solution. (though if you are just getting started, I wouldn’t recommend to look into the auto-scaling solution right away, rather just start off with a normal EC2 and later on introducing auto-scaling is a piece of cake).
However, if I were to start off today, I would recommend using Lambda functions over EC2 as they add one more layer of a managed system(I did have to wake up at night several times cause the EC2 instance was down — which will not be the case with Lambda functions)
On a side note, we had a Redis db on each EC2 instance to cache recurring DB queries such as fetching the UserID based on the token.
View
Android — native Java Android with SQLite to store messages
iOS — Objective C(swift was not stable in 2014), Core Data
Web — AngularJS, Web Storage, Indexed DB
DB Design
I will cover the basic DB design from a chat app point of view
- A User table(that’s where you always start)
 Secondary user tables as we cannot have more than 6 indexes in a table in DynamoDB
- A user session token table — this will be used to identify users device as well
- A chat group table — here the primary key will be a chat group id and the secondary key will be the user ids in that chat group
- Messages table — the primary key will be a chat group id (kinda foreign key to the above table) and the secondary key is a random message id
Well that’s all that you need to have a basic chat
Messaging(pub-sub)
I used a AWS SQS to send and receive a messages per user
- So there is a parent queue for each and every user, lets call it parent-user queue
- Now each and every session token that a user has will have a queue corresponding to that user-session(SQS allows you to create infinite number of queues and they are well distributed around the world)
 This queue is simply created by concatenating the user id along with the session id
We can consider that each queue to be a message bucket for that user session.
A note worthy feature of SQS is that of long polling — wherein we can long poll SQS for 20 seconds. In this case, if there a message already present in the queue(bucket), it will respond immediately, else, it will wait for 20 seconds for the next message to appear. e.g. if a the client started long polling at time t = 0 second, and a message appeared in the bucket at t = 12th second, SQS will immediately respond with that message at 12th second
Architecture
Now, we discuss the messaging flow in case of each and every chat group. Each chat group can have one or more than 1 users (Mostly in my case it were 2 users)
Message flow: here is how the messages flow in the system
- When ever the user is on the app and logged in, he will send a message to a group
- Firstly, the message is stored in the messages table
- Then we will search for all the users in that group
- Based on the user id, we search for all the valid session tokens corresponding to the user
- Later this message is sent to each and every user-session queue (bucket). e.g. so say a has 3 devices implying 3 sessions token, then the message is sent to each and every 3 user-session queues
- Now, on the client side, the client will be long polling the server every 20 seconds or
- So, in case (1)there is a message, in the queue, SQS will respond immediately, and all the messages will be delivered to the client
- In case (2)no message is present, SQS will respond saying no messages in the queue and the connection will end after 20 seconds; post which the client will make another new connection
- In case(3) the message appears within the 20 seconds time frame, SQS responds immediately with the message which will be shown to the user
- Hence the message is delivered to all the users in the group on all the devices
Pros and cons of this system
Pros
- As all the services used are managed, we do not have to worry about service up-time at all
- Highly customizable : one can add n-number of components, as and where you want to add,
Cons
- As this is a completely based on managed solution, there is a heavy platform dependency. Moving out of one Cloud service to another (say from AWS to GCP, would mean a lot of work). Well one of the solution to resolve this is to use in-house solutions on EC2 machines(e.g. Kaafka instead of SQS and SNS) but that would mean a lot of overhead for a lean team
- Pricing can be expensive but only at a very late stage
TL;DR — use as many managed services possible to speed up development and you can resolve the message polling and queuing using a queuing service likes of AWS SQS

