280 Web Scalability for Startup Engineers
Message ordering is a serious issue to consider when architecting a message-
based application, and RabbitMQ, ActiveMQ, and Amazon SQS messaging
platform cannot guarantee global message ordering with parallel workers. In fact,
Amazon SQS is known for unpredictable ordering of messages because their
infrastructure is heavily distributed and ordering of messages is not supported.
You can learn more about some interesting ways of dealing with message
ordering.w14,w52
Message Requeueing
As previously mentioned, messages can be requeued in some failure scenarios.
Dealing with this problem can be easy or difficult, depending on the application
needs. A strategy worth considering is to depend on at-least-once delivery instead
of exactly-once delivery. By allowing messages to be delivered to your consumers
more than once, you make your system more robust and reduce constraints put
on the message queue and its workers. For this approach to work, you need to
make all of your consumers idempotent, which may be difficult or even impossible
in some cases.
An idempotent consumer is a consumer that can process the same
message multiple times without affecting the final result. An example of
an idempotent operation would be setting a price to $55. An example
of a nonidempotent operation would be to “increase price by $5.” The
difference is that increasing the price by $5 twice would increase it by a
total of $10. Processing such a message twice affects the final result. In
contrast, setting the price to $55 once or twice leaves the system in the
same state.
Unfortunately, making all consumers idempotent may not be an easy thing to
do. Sending e-mails is, by nature, not an idempotent operation, because sending
two e-mails to the customer does not produce the same result as sending just a
single e-mail. Adding an extra layer of tracking and persistence could help, but
it would add a lot of complexity and may not be able to handle all of the failure
scenarios. Instead, make consumers idempotent whenever it is practical, but
remember that enforcing it across the system may not always be worth the effort.
Chapter 7: Asynchronous Processing 281
Finally, idempotent consumers may be more sensitive to messages being
processed out of order. If we had two messages, one to set the product’s price to
$55 and another one to set the price of the same product to $60, we could end up
with different results based on their processing order. Having two nonidempotent
consumers increasing the price by $5 each would be sensitive to message
requeueing (redelivery), but not to out-of-order delivery.
Race Conditions Become More Likely
One of the biggest challenges related to asynchronous systems is that things that
would happen in a well-defined order in a traditional programming model can
suddenly happen in a much more unexpected order. As a result, the asynchronous
programming is more unpredictable by nature and more prone to race conditions,
as work is broken down into much smaller chunks and there are more possible
orders of execution.
Since asynchronous calls are made in a nonblocking way, message producers
can continue execution without waiting for the results of the asynchronous call.
Different message consumers may also execute in a different order because there
is no built-in synchronization mechanism. Different parts of an asynchronous
system, especially a distributed one, can have different throughput, causing
uneven latency in message propagation throughout the system.
Especially when a system is under heavy load, during failure conditions and
deployments, code execution may become slower in different parts of the system.
This, in turn, makes things more likely to happen in unexpected order. Some
consumers may get their messages much later than others, causing hard-to-
reproduce bugs.
HINT
You could say that asynchronous programming is programming without a call stack.w11 Things
simply execute as soon as they are able to, instead of traditional step-by-step programming.
The increased risk of race conditions is mainly caused by the message-ordering
issue discussed earlier. Get into a habit of careful code review, with an explicit
search for race conditions and out-of-order processing bugs. Doing so will increase
your chance of mitigating issues and building more robust solutions. The less you
assume about the state of an asynchronous system, the better.
282 Web Scalability for Startup Engineers
Risk of Increased Complexity
Systems built as hybrids of traditional imperative and message-oriented code
can become more complex because their message flow is not explicitly declared
anywhere. When you look at the producer, there is no way of telling where the
consumers are or what they do. When you look at the consumer, you cannot be sure
under what conditions messages are published. As the system grows and messaging
is added ad hoc through the code, without considering the overall architecture,
it may become more and more difficult to understand the dependencies.
When integrating applications using a message broker, you must be very diligent
in documenting dependencies and the overarching message flow. Remember the
discussion about levels of abstraction and how you should be able to build the mental
picture of the system (Chapter 2). Without good documentation of the message
routes and visibility of how the messages flow through the system, you may increase
the complexity and make it much harder for developers to understand how the
system works.
Keep things simple and automate documentation creation so it will be generated
based on the code itself. If you manage to keep documentation of your messaging in
sync with your code, you should be able to find your way through the dependencies.
Message Queue–Related Anti-Patterns
In addition to message queue–related challenges, I would like to highlight a few
common design anti-patterns. Engineers tend to think alike, and they often create
similar solutions to similar problems. When the solution proves to be successful
over and over again, we call it a pattern, but when the solution is repeatedly
difficult to maintain or extend, we call it an anti-pattern. A typical anti-pattern is
a solution that seems like a good idea at first, but the longer you use it, the more
issues you discover with it. By getting familiar with anti-patterns, you should be
able to easily avoid them in the future—it is like getting a vaccination against a
common design bug.
Treating the Message Queue as a TCP Socket
Some message brokers allow you to create return channels. A return channel
becomes a way for the consumer to send a message back to the producer. If
you use it a lot, you may end up with an application that is more synchronous
than asynchronous. Ideally, you would want your messages to be truly one-way
Chapter 7: Asynchronous Processing 283
requests (fire-and-forget). Opening a response channel and waiting for response
messages makes messaging components more coupled and undermines some
of the benefits of messaging. Response channels may also mean that failures
of different components on different sides of the message broker may have an
impact on one another. When building scalable systems, avoid return channels, as
they usually lead to synchronous processing and excessive resource consumption.
Treating Message Queue as a Database
You should not allow random access to elements of the queue. You should not
allow deleting messages or updating them, as this will lead to increased complexity.
It is best to think of a message queue as an append-only stream (FIFO). It is most
common to see such deformations when the message queue is built on top of a
relational database or NoSQL engine because this allows secondary indexes and
random access to messages. Using random access to modify and delete messages
may prevent you from scaling out and migrating to a different messaging broker.
If you have to delete or update messages in flight (when they are in the middle
of the queue), you are probably doing something wrong or applying messaging to
a wrong use case.
Coupling Message Producers with Consumers
As I mentioned before, it is best to avoid explicit dependency between producers
and consumers. You should not hardcode class names or expect messages to
be produced or consumed by any particular piece of code. It is best to think of
the message broker as being the endpoint and the message body as being the
contract. There should be no assumptions or any additional knowledge necessary.
If something is not declared explicitly in the message contract, it should be an
implementation detail, and it should not matter to the other side of the contract.
For example, a flawed implementation I saw involved serializing an entire
object and adding it to the message body. This meant that the consumer had to
have this particular class available, and it was not able to process the message
without executing the serialized object’s code. Even worse, it meant that the
consumer had to be implemented in the same technology as the producer and its
deployment had to be coordinated to prevent class mismatches. Messages should
not have “logic” or executable code within. Messages should be a data transfer
object10 or simply put, a string of bytes that can be written and read by both
consumer and producer.
Treat the format of the message as a contract that both sides need to understand,
and disallow any other type of coupling.
284 Web Scalability for Startup Engineers
Lack of Poison Message Handling
When working with message queues you have to be able to handle broken messages
and bugs in consumer code. A common anti-pattern is to assume that messages
are always valid. A message of death (also known as a poison message) is a
message that causes a consumer to crash or fail in some unexpected way. If your
messaging system is not able to handle such cases gracefully, you can freeze your
entire message-processing pipeline, as every time a consumer crashes, the broker
will requeue the message and resend it to another consumer. Even with auto-
respawning consumer processes, you would freeze the pipeline, as all of your
consumers would keep crashing and reprocessing the same message for infinity.
To prevent that scenario, you need to plan for failures. You have to assume that
components of your messaging platform will crash, go offline, stall, and fail in
unexpected ways. You also have to assume that messages may be corrupt or even
malicious. Assuming that everything would work as expected is the quickest way
to building an unavailable system.
HINT
Hope for the best, prepare for the worst.
You can deal with a poison message in different ways depending on which
message broker you use. In ActiveMQ you can use dead-letter queue policies out
of the box.25 All you need to do is set limits for your messages, and they will be
automatically removed from the queue after a certain number of failures. If you
use Amazon SQS, you can implement poison message handling in your own code
by using an approximate delivery counter. Every time a message is redelivered,
SQS increments its approximate delivery counter so that your application could
easily recognize messages of death and route them to a custom dead-letter queue
or simply discard them. Similarly, in RabbitMQ you get a boolean flag telling you
if a message has been delivered before, which could be used to build a dead-letter
queue functionality. Unfortunately, it is not as simple to use as having a counter or
an out-of-the-box functionality.
Whenever you use message queues, you simply have to implement poison
message handling.
Quick Comparison of Selected Messaging Platforms
Choosing a message broker is similar to choosing a database management system.
Most of them work for most use cases, but it always pays to know what you are
dealing with before making a commitment. This section is a quick overview of
Chapter 7: Asynchronous Processing 285
three most common message brokers: Amazon Simple Queue Service (SQS),
RabbitMQ, and ActiveMQ.
Unfortunately, there is no way to recommend a messaging platform without
knowing details of the application use cases, so you may have to do some more
research before making your final decision. I recommend reading more25,12,L1–L3 to
learn specific details about selected platforms. Here, let’s focus on the strengths
and best use cases of each platform, which should empower you with the
knowledge necessary to begin your own selection.
Amazon Simple Queue Service
Amazon SQS is known for its simplicity and pragmatic approach. SQS is a cloud-
based service provided by Amazon with a public application programming
interface (API) and software development kit (SDK) libraries available for most
programming languages. It is hosted and managed by Amazon, and users are
charged pro rata for the amount of messages they publish and amount of service
calls they issue.
If you are hosting your application on Amazon EC2, Amazon SQS, which is
a hosted messaging platform, is certainly worth considering. The main benefit
of using SQS is that you do not have to manage anything yourself. You do not
have to scale it, you do not need to hire additional experts, you do not need to
worry about failures. You do not even need to pay for additional virtual server
instances that would need to run your message brokers. SQS takes care of the
infrastructure, availability, and scalability, making sure that messages can be
published and consumed all the time.
If you work for a startup following the Lean Startup methodology, you should
consider leveraging SQS to your advantage. Lean Startup advocates minimal
viable product (MVP) development and a quick feedback loop.30,9 If SQS
functionality is enough for your needs, you benefit in the following ways:
▶▶ Deliver your MVP faster because there is no setup, no configuration, no
maintenance, no surprises.
▶▶ Focus on the product and customers instead of spending time on the
infrastructure and resolving technical issues.
▶▶ Save money by using SQS rather than managing message brokers yourself.
Saving time and money in early development stages (first 6 to 12 months) is
critical, because your startup may change direction very rapidly. Startup reality
is so unpredictable that a few months after the MVP release, you may realize that
286 Web Scalability for Startup Engineers
you don’t need the messaging component at all, and then all the time invested
into it would become a waste!
If you do not prioritize every dollar and every minute spent, your startup may
run out of money before ever finding product-market fit (offering the right service
to the right people). SQS is often a great fit for early-stage startups, as it has the
lowest up-front time and money cost.
HINT
Any up-front cost, whether it is money or time, may become a waste. The higher the chance of
changes, the higher the risk of investment becoming a waste.
To demonstrate the competitiveness of Amazon SQS, let’s have a look at
a simple cost comparison. To deploy a highly available message broker using
ActiveMQ or RabbitMQ, you will need at least two servers. If you are using
Amazon EC2, at the time of writing, two medium-sized reserved instances would
cost you roughly $2,000 a year. In comparison, if you used SQS and needed, on
average, four requests per message, you would be able to publish and process one
billion messages per year for the same amount of money. That is 32 messages per
second, on average, throughout the entire year.
In addition, by using SQS you can save hours needed to develop, deploy,
manage, upgrade, and configure your own message brokers, which can easily add
up to thousands of dollars per year. Even if you assumed that initial time effort
to get message brokers set up and integrated would take you a week of up-front
work, plus an hour a week of ongoing maintenance effort, you would end up with
at least two weeks of time spent looking after your broker rather than looking
after your customers’ needs.
Simply put, if you don’t expect large message volumes, or you don’t know
what to expect at all, you are better off using SQS. SQS offers just the most basic
functionality, so even if you decide to use your own messaging broker later on,
you should have no problems migrating away from it. All you need to do when
integrating with SQS is to make sure your publishers and consumers are not
coupled directly to SQS SDK code. I recommend using thin wrappers and your
own interfaces together with design patterns like Dependency Injection, Factory,
façade, and Strategy.1,7,10 Figure 7-20 shows how your infrastructure becomes
simplified by removing custom messaging brokers and using SQS.
When it comes to scalability, SQS performs very well. It scales automatically
according to your needs and provides really impressive throughput without any
preparation or capacity planning. You should be able to publish and consume
tens of thousands of messages per second, per queue (assuming multiple
Chapter 7: Asynchronous Processing 287
Traf c Volumes Customers 1
Internet DNS/geoDNS
Very Heavy
10
Heavy Content Delivery Network
Medium
Light
Light, UDP
User’s network
Your Data Center
2
Load Balancer
3
Front Cache 1 Front Cache N
4
AWS
Front App. Server 1 Front App. Server M
5 6 Amazon SQS Service
7
Cache Servers
Web Services Server 1 Web Services Server K Queue Workers
8 9
Data Store Servers Search Servers
Figure 7-20 Simplified infrastructure depending on Amazon SQS
288 Web Scalability for Startup Engineers
concurrent clients). Adding more queues, producers, and consumers should allow
you to scale without limits.
It is important to remember that SQS is not a golden hammer, though. It scales
well, but it has its limitations. Let’s quickly discuss its disadvantages.
First of all, Amazon had to sacrifice some features and guarantees to be able to
scale SQS easily. Some of the features missing in SQS are that it does not provide any
complex routing mechanisms and is less flexible than RabbitMQ or ActiveMQ.12,25,L3
If you decide to use SQS, you will not be able to deploy your own logic into it or
modify it in any way, as it is a hosted service. You either use it as is, or you don’t use
it at all.
Second, SQS has limits on message size, and you may be charged extra if you
publish messages with large bodies (tens of kilobytes).
Another important thing to remember is that messages will be delivered out of
order using SQS and that you may see occasional redeliveries. Even if you have a
single producer, single queue, and single consumer, there is no message-ordering
guarantee whatsoever.
Finally, you pay per service call, which means that polling for nonexisting
messages counts as a service call; it also means that sending thousands of messages
per second may become more expensive than using your own message broker.
If your company is a well-established business and you are not dealing with
a huge amount of uncertainty, it may be worth performing a deeper analysis of
available platforms and choose a self-managed messaging broker, which could
give you more flexibility and advanced features. Although SQS is great from a
scalability and up-front cost point of view, it has a very limited feature set. Let’s
see now what self-managed brokers can offer.
RabbitMQ
RabbitMQ is a high-performance platform created initially for financial institutions.
It provides a lot of valuable features out of the box, it is relatively simple to operate,
and it is extremely flexible. Flexibility is actually the thing that makes RabbitMQ
really stand out.
RabbitMQ supports two main messaging protocols—AMQP and STOMP—
and it is designed as a generic-purpose messaging platform, without preferences
towards Java or any other programming language.
The most attractive feature of RabbitMQ is the ability to dynamically configure
routes and completely decouple publishers from consumers. In regular messaging,
the consumer has to be coupled by a queue name or a topic name. This means
that different parts of the system have to be aware of one another to some extent.
Chapter 7: Asynchronous Processing 289
In RabbitMQ, publishers and consumers are completely separated because they
interact with separate endpoint types. RabbitMQ introduces a concept of an
exchange.
An exchange is just an abstract named endpoint to which publishers
address their messages. Publishers do not have to know topic names or
queue names as they publish messages to exchanges. Consumers, on
the other hand, consume messages from queues.
Publishers have to know the location of the message broker and the name
of the exchange, but they do not have to know anything else. Once a
message is published to an exchange, RabbitMQ applies routing rules
and sends copies of the message to all applicable queues. Once messages
appear in queues, consumers can consume them without knowing anything
about exchanges.
Figure 7-21 shows how RabbitMQ takes care of routing and insulates publishers
from consumers, both physically and logically. The trick is that routing rules
can be defined externally using a web administration interface, AMQP protocol,
or RabbitMQ’s REST API. You can declare routing rules in the publisher’s or
consumer’s code, but you are not required to do so. Your routing configuration
can be managed externally by a separate set of components.
RabbitMQ
Exchange A
Publisher A Routing Queue 1 Consumer X
Publisher B Queue 2 Consumer Y
Exchange B
Figure 7-21 RabbitMQ fully decoupling publishers from consumers
290 Web Scalability for Startup Engineers
If you think about message routing this way, you move closer towards
service-oriented architecture (SOA). In SOA, you create highly decoupled and
autonomous services that are fairly generic and that can be wired together
to build more complex applications using service orchestration and service
policies.31 In the context of RabbitMQ, you can think of it as an external
component that can be used to decide which parts of the system should
communicate with each other and how messages should flow throughout the
queues. The important thing about RabbitMQ routing is that you can change
these routing rules remotely, and you can do it on the fly, without the need to
restart any components.
It is worth noting that RabbitMQ can provide complex routing based on
custom routing key patterns and simpler schemas like direct queue publishing
and publish/subscribe.
Another important benefit of using RabbitMQ is that you can fully configure,
monitor, and control the message broker using its remote REST API. You can
use it to create any of the internal resources like hosts, nodes, queues, exchanges,
users, and routing rules. Basically, you can dynamically reconfigure any aspect
of the message broker without the need to restart anything or run custom code
on the broker machine. To make things even better, the REST API provided
by RabbitMQ is really well structured and documented. Figure 7-22 shows
RabbitMQ’s self-documenting endpoint, so you don’t even need to search for the
documentation of the API version you are running to learn all about it.
Figure 7-22 Fragment of RabbitMQ REST API documentation within the endpoint
Chapter 7: Asynchronous Processing 291
When it comes to feature comparison, RabbitMQ is much richer than SQS and
supports more flexible routing than ActiveMQ. On the other hand, it does miss
a few nice-to-have features like scheduled message delivery. The only important
drawbacks of RabbitMQ are the lack of partial message ordering and poor poison
message support.
From a scalability point of view, RabbitMQ is similar to ActiveMQ. Its
performance is comparable to ActiveMQ as well. It supports different clustering
and replication topologies, but unfortunately, it does not scale horizontally out of
the box, and you would need to partition your messages across multiple brokers
to be able to scale horizontally. It is not very difficult, but it is not as easy as when
using SQS, which simply does it for you.
If you are not hosted on Amazon EC2 or you need more flexibility, RabbitMQ
is a good option for a message broker. If you are using scripting languages like
PHP, Python, Ruby, or Node.js, RabbitMQ will allow you to leverage its flexibility
and configure it at runtime using AMQP and RabbitMQ’s REST API.
ActiveMQ
The last message broker I would like to introduce is ActiveMQ. Its functionality
is similar to RabbitMQ and it has similar performance and scalability abilities.
The main difference is that it is written in Java and it can be run as an embedded
message broker within your application. This offers some advantages and may be
an important decision factor if you develop mainly in Java. Let’s go through some
of the ActiveMQ strengths first and then discuss some of its drawbacks.
Being able to run your application code within the message broker or run the
message broker within your application process allows you to use the same code
on both sides of the border. It also allows you to achieve much lower latency
because publishing messages within the same Java process is basically a memory
copy operation, which is orders of magnitude faster than sending data over a
network.
ActiveMQ does not provide advanced message routing like RabbitMQ,
but you can achieve the same level of sophistication by using Camel. Camel
is an integration framework designed to implement enterprise integration
patterns,10,31–32 and it is a great tool in extending ActiveMQ capabilities.
Camel allows you to define routes, filters, and message processors using XML
configuration and allows you to wire your own implementations of different
components. If you decide to use Camel, you will add extra technology to your
stack, increasing the complexity, but you will gain many advanced messaging
features.
292 Web Scalability for Startup Engineers
In addition to being Java based, ActiveMQ implements a common messaging
interface called JMS (Java Message Service) and allows the creation of plugins,
written also in Java.
Finally, ActiveMQ implements message groups mentioned earlier, which allow
you to partially guarantee ordered message delivery. This feature is quite unique
and neither RabbitMQ nor SQS has anything like that. If you desperately need
FIFO-style messaging, you may want to use ActiveMQ.
We went through some of the most important strengths of ActiveMQ, so now
it is time to mention some of its drawbacks.
First, ActiveMQ has much less flexible routing than RabbitMQ. You could use
Camel, but if you are not developing in Java, it would add to the burden for your
team. Also, Camel is not a simple technology to use, and I would recommend
using it only if you have some experienced engineers on the team. There are a
few features allowing you to build direct worker queues and persistent fan-out
queues, but you don’t have the ability to route messages based on more complex
criteria.
The second major drawback in comparison to RabbitMQ is that ActiveMQ
cannot be fully controlled using its remote API. In contrast, RabbitMQ can be
fully configured and monitored using a REST API. When dealing with ActiveMQ,
you can control some aspects of the message broker using the JMX (Java
Management Extensions) protocol, but it is not something you would like to use
when developing in languages other than Java.
Finally, ActiveMQ can be sensitive to large spikes of messages being published.
It happened to me multiple times during load tests that ActiveMQ would simply
crash when being overwhelmed by high message rates for extended periods
of time. Although it is a stable platform, it does not have access to low-level
functions like memory allocation and I/O control because it runs within JVM.
It is still possible to run out of memory and crash the broker if you publish too
many messages too fast.
Final Comparison Notes
Comparing ActiveMQ and RabbitMQ based on Google Trends,L4 we can see that
RabbitMQ has gained a lot of popularity in recent years and both message brokers
are pretty much going head to head now (as of this writing). Figure 7-23 shows
ActiveMQ and RabbitMQ over the course of the last five years.
These trends may also be caused by the fact that RabbitMQ was acquired
by SpringSource, which is one of the top players in the world of Java, and that
ActiveMQ is being redeveloped from scratch under a new name, Apollo.
Chapter 7: Asynchronous Processing 293
80 RabbitMQ ActiveMQ
70
60
50
40
30
20
10
0
2008-01 2008-06 2009-01 2009-06 2010-01 2010-06 2011-01 2011-06 2012-01 2012-06 2013-01 2013-06 2014-01
Figure 7-23 ActiveMQ and RabbitMQ search popularity according to Google Trends
Another way to compare brokers is by looking at their high-availability focus
and how they handle extreme conditions. In this comparison, ActiveMQ scores
the worst of all three systems. It is relatively easy to stall or even crash ActiveMQ
by simply publishing messages faster than they can be routed or persisted.
Initially, ActiveMQ buffers messages in memory, but as soon as you run out of
RAM, it either stalls or crashes completely.
RabbitMQ performs better in such a scenario, as it has a built-in backpressure
feature. If messages are published faster than they can be processed or persisted,
RabbitMQ begins to throttle producers to avoid message loss and running out of
memory. The benefit of that approach is increased stability and reliability, but it
can cause unexpected delays on the publisher side, as publishing messages slows
down significantly whenever backpressure is triggered.
In this comparison, SQS performs better than both ActiveMQ and RabbitMQ,
as it supports very high throughput and Amazon is responsible for enforcing
high availability of the service. Although SQS is a hosted platform, you can still
experience throttling in some rare situations and you need to make sure that your
publishers can handle failures correctly. You do not have to worry about crashing
brokers, recovery procedures, or scalability of SQS, though, as it is managed by
Amazon.
No matter which of the three technologies you choose, throughput is always
finite and the best way to scale is by partitioning messages among multiple broker
instances (or queues in the case of SQS).
294 Web Scalability for Startup Engineers
If you decide to use SQS, you should be able to publish tens of thousands of
messages per second, per queue, which is more than enough for most startups.
If you find yourself reaching that limit, you would need to create multiple
queue instances and distribute messages among them to scale out your overall
throughput. Since SQS does not preserve message ordering and has very few
advanced features, distributing messages among multiple SQS queues should be
as easy as picking one of the queues at random and publishing messages to it. On
the consumer side, you would need similar numbers of consumers subscribing
to each of the queues and similar hardware resources to provide even consumer
power.
If you decide to use ActiveMQ or RabbitMQ, your throughput per machine is
going to depend on many factors. Primarily you will be limited by CPU and RAM
of machines used (hardware or virtual servers), average message size, message
fan-out ratio (how many queues/customers each message is delivered to), and
whether your messages are persisted to disk or not. Regardless of how many
messages per second you can process using a single broker instance, as you need
to scale out, your brokers need to be able to scale out horizontally as well.
As I mentioned before, neither ActiveMQ nor RabbitMQ supports horizontal
scalability out of the box, and you will need to implement application-level
partitioning to distribute messages among multiple broker instances. You would
do it in a similar way as you would deal with application-level data partitioning
described in Chapter 5. You would deploy multiple brokers and distribute
messages among them. Each broker would have the exact same configuration
with the same queues (or exchanges and routing). Each of the brokers would also
have a pool of dedicated customers.
If you use ActiveMQ and depend on its message groups for partial message
ordering, you would need to use the message group ID as a sharding key so that
all of the messages would be published to the same broker, allowing it to enforce
ordering. Otherwise, assuming no message-ordering guarantees, you could select
brokers at random when publishing messages because from the publisher’s point
of view, each of them would be functionally equal.
Messaging platforms are too complex to capture all their differences and
gotchas on just a few pages. Having said that, you will need to get to know your
tools before you can make really well-informed choices. In this section, I only
mentioned the most popular messaging platforms, but there are more message
brokers out there to choose from. I believe messaging is still an undervalued
technology and it is worth getting to know more platforms. I recommend starting
the process by reading about RabbitMQ12 and ActiveMQ,25 as well as a fantastic
paper on Kafka.w52
Chapter 7: Asynchronous Processing 295
Introduction to Event-Driven Architecture
We have gone a long way since the beginning of this chapter, but there is
one more exciting concept I would like to introduce, which is event-driven
architecture (EDA). In this section I will explain the core difference between the
traditional programming model and EDA. I will also present some of its benefits
and how you can use it within a larger non-EDA system.
First of all, to understand EDA, you need to stop thinking about software in
terms of requests and responses. Instead, you have to think about components
announcing things that have already happened. This subtle difference in the
way you think about interactions has a profound impact on the architecture and
scalability. Let’s start off slowly by defining some basic terms and comparing how
EDA is different from the traditional request/response model.
Event-driven architecture (EDA) is an architecture style where most
interactions between different components are realized by announcing
events that have already happened instead of requesting work to
be done. On the consumer side, EDA is about responding to events
that have happened somewhere in the system or outside of it. EDA
consumers do not behave as services; they do not do things for others.
They just react to things happening elsewhere.
An event is an object or a message that indicates something has
happened. For example, an event could be announced or emitted
whenever an order in an online store has been placed. In such case, an
event would probably contain information about the buyer and items
purchased. An event is an entity holding the data necessary to describe
what has happened. It does not have any logic; instead, it is best to think
of an event as a piece data describing something that has happened in the
real world or within the application.
So far the difference between EDA and messaging can still be quite blurry. Let’s
have a closer look at the differences between the following interaction patterns:
request/response, messaging, and EDA.
296 Web Scalability for Startup Engineers
Request/Response Interaction
This is the traditional model, resembling the synchronous method or function
invocation in traditional programming languages like C or Java. A caller sends
a request and waits for the receiver to process the message and return with a
response. I described this model in detail earlier in this chapter, so we won’t go
into more detail here. The important things to remember are that the caller has
to be able to locate the receiver, it has to know the receiver’s contract, and it is
temporally coupled to the receiver.
Temporal coupling is another term for synchronous invocation and means
that caller cannot continue without the response from the receiver. This
dependency on the receiver to finish its work is where coupling comes
from. In other words, the weakest link in the entire call stack dictates the
overall latency. (You can read more about temporal coupling.w10,31)
In the case of request/response interactions, the contract includes the location
of the service, the definition of the request message, and the definition of the
response message. Clients of the service need to know at least this much to be
able to use the service. Knowing things about the service implies coupling, as we
discussed it in Chapter 2—the more you need to know about a component, the
stronger is your coupling to it.
Direct Worker Queue Interaction
In this interaction model, the caller publishes messages into the queue or a
topic for consumers to react to. Even though this is much more similar to the
event-driven model, it still leaves opportunities for closer coupling. In this
model, the caller would usually send a message to a queue named something like
OrderProcessingQueue, indicating that the caller knows what needs to be done
next (an order needs to be processed).
The good side of this approach is that it is asynchronous and there is no
temporal coupling between the producer and consumer. Unfortunately, it usually
happens that the producer knows something about the consumer and that the
message sent to the queue is still a request to do something. If the producer
Chapter 7: Asynchronous Processing 297
knows what has to be done, it is still coupled to the service doing the actual
work—it may not be coupled by the contract, but it is still coupled logically.
In the case of queue-based interaction, the contract consists of the queue
location, the definition of the message sent to the queue, and quite often, the
expectation about the result of the message being processed. As I already
mentioned, there is no temporal coupling and since we are not expecting a
response, we also reduce the contract’s scope because the response message is not
part of it any more.
Event-Based Interaction
Finally, we get to the event-driven interaction model, where the event publisher
has no idea about any consumers being present. The event publisher creates an
instance of an event, for example, NewOrderCreated, and announces it to the
event-driven framework. The framework can use an ESB, it can be a built-in
component, or it can even use a messaging broker like RabbitMQ. The important
thing is that events can be published without having to know their destination.
Event publishers do not care who reacts or how they react to events.
By its nature, all event-driven interactions are asynchronous, and it is assumed
that the publisher continues without needing to know anything about consumers.
The main advantage of this approach is that you can achieve a very high level of
decoupling. Your producers and consumers do not have to know each other. Since
the event-driven framework wires consumers and producers together, producers
do not need to know where to publish their event—they just announce them. On
the other hand, consumers do not need to know how to get to the events they are
interested in either—they just declare which types of events they are interested in,
and the event-driven framework is responsible for routing them to the consumer.
It is worth pointing out that the contract between producer and consumers
is reduced to just the event message definition. There are no endpoints, so there
is no need to know their locations. Also, since the publisher does not expect
responses, the contract does not include them either. All that publishers and
consumers have in common is the format and meaning of the event message.
To visualize it better, let’s consider two more diagrams. Figure 7-24 shows
how the client and service are coupled to each other in the request/response
interaction model. It shows all the pieces of information that the client and
service need to share to be able to work together. The total size of the contract is
called the contract surface area.
298 Web Scalability for Startup Engineers
Request/Response Coupling Surface Area
Client Component Depends on Service Component
How to nd Coupling Surface Area Location needs to be
the service preserved or backward
Service request compatible
de nition (methods and Service response
restricts changes as
arguments) contract needs to remain
Service response Methods and parameters
de nition and how to need to remain for
backward compatibility
interpret it
Expectations about
service’s side effects
Temporal dependency Needs to respond
on service’s response within agreed time (SLA)
Constrained by
Figure 7-24 Coupling surface area between the service and its clients
Contract Surface Area is the measurement of coupling. The more
information components need to know about each other to collaborate,
the higher the surface area. The term comes from diagrams and UML
modeling as the more lines you have between two components, the
stronger the coupling.
In the Request/Response interaction model clients are coupled to the service
in many ways. They need to be able to locate the service and understand its
messages. Contract of the service includes both request and response messages.
The client is also coupled temporally, as it has to wait for the service to respond.
Finally, clients often assume a lot about the service’s methods. For example,
clients of the createUser service method could assume that a user object gets
created somewhere in the service’s database.
Chapter 7: Asynchronous Processing 299
On the other side of the contract, the service does not have an easy job
adapting to changing business needs as it needs to keep the contract intact. The
service is coupled to its clients by every action that it ever exposed and by every
piece of information included in request or response messages ever exposed. The
service is also responsible for supporting agreed SLA (Service Layer Agreement)
which means responding quickly and not going offline too often. Finally service
is constrained by the way it is exposed to its clients, which may prevent you from
partitioning the service into smaller services to scale better.
In comparison, Figure 7-25 shows EDA interactions. We can see that many
coupling factors are removed and that the overall coupling surface area is much
smaller. Components do not have to know much about each other, and the only
point of coupling is the event definition itself. Both the publisher and consumer
have to establish a shared understanding of the event type body and its meaning.
In addition, the event consumer may be constrained by the event message,
because if certain data was not included in the event definition, the consumer
may need to consult a shared source of truth, or it may not have access to a piece
of information at all.
In a purely EDA, all the interactions are based on events. This leads to an
interesting conclusion that if all of the interactions are asynchronous and all the
interactions are carried out using events, you could use events to re-create the
state of the entire system by simply replaying events. This is exactly what event
sourcing allows us to do.L6–L7,24
Event-Driven Coupling Surface Area
Publisher Component Depends on Consumer Component
Event message Coupling Surface Area Event message
format and meaning format and meaning
Constrained by the
event de nition and
data included
Constrained by
Figure 7-25 Coupling surface area between EDA components
300 Web Scalability for Startup Engineers
Event sourcing is a technique where every change to the application
state is persisted in the form of an event. Events are usually stored on
disk in the form of event log files or some data store. At the same time,
an application is built of event consumers, which process events passed
to them. As a result, you can restore the system to an old state (for
example, using a daily snapshot) and replay events to reach the same
end state.
I have seen EDA with event sourcing in action handling 150,000 concurrently
connected clients performing transactions with financial ramifications. If
there was ever a crash, the entire system could be recovered to the most recent
consistent state by replaying the event log. It also allowed engineers to copy
the event log and debug live issues in the development environment by simply
replaying the event logs. It was a very cool sight.
In fact, asynchronous replication of distributed systems is often done in a
similar way. For example, MySQL replication is done in a similar way, as every
data modification is recorded in the binary log right after the change is made
on the master server. Since all state changes are in the binary log, the state of
the slave replica server can be synchronized by replaying the binary log.16 The
only difference is that consumers of these events are replicating slaves. Having
all events persisted in a log means that you can add a new event consumer and
process historical events, so it would look like it was running from the beginning
of time.
The important limitation of event sourcing is the need for a centralized state
and event log. To be able to reconstruct the state of the application based on
event log alone, you need to be processing them in the same order. You could
say that you need to assume a Newtonian perception of time with an absolute
ordering of events and a global “now.” Unfortunately, in distributed systems
that are spanning the globe, it becomes much harder because events may be
happening simultaneously on different servers in different parts of the world.
You can read more about the complexity of event sourcing and reasoning about
time, L7,39 but for simplicity, you can just remember that event sourcing requires
sequential processing of all events.
Whether you use event sourcing or not, you can still benefit from EDA and
you can benefit from it even in pre-existing systems. If you are building a new
Chapter 7: Asynchronous Processing 301
application from scratch, you have more freedom of choice regarding which parts
should be developed in EDA style, but even if you are maintaining or extending
an existing application, there are many cases where EDA will come in handy. The
only trick is to start thinking of the software in terms of events. If you want to add
new functionality and existing components do not have to know the results of the
operation, you have a candidate for an event-driven workflow.
For example, you could develop a core of your online shopping cart in a
traditional way and then extend it by publishing events from the core of the
system. By publishing events, you would not make the core depend on external
components, you would not jeopardize its availability or responsiveness, yet
you could add new features by adding new event consumers later on. The
EDA approach would also let you scale out, as you could host different event
consumers on different servers.
Summary
We covered a lot of material in this chapter, discussing asynchronous processing,
messaging, different brokers, and EDA. To cover these topics in depth would
warrant a book dedicated to each. Our discussion here has been simple and fairly
high level. The subject matter is quite different from the traditional programming
model, but it is really worth learning. The important thing to remember is that
messaging, EDA, and asynchronous processing are just tools. They can be great
when applied to the right problem, but they can also be a nightmare to work with
when forced into the wrong place.
You should come away from this chapter with a better understanding of the
value of asynchronous processing in the context of scalability and having gained
enough background to explore these topics on your own. All of the concepts
presented in this chapter are quite simple and there is nothing to be intimidated
by, but it can take some time before you feel that you fully understand the
reasoning behind them. Different ways of explaining the same thing may work
better for different people, so I strongly encourage you to read more on the
subjects. I recommend reading a few books31–32,24–27,12 and articles.L6,w10–w11
Asynchronous processing is still underinvested. High-profile players like
VMware (RabbitMQ, Spring AMQP), LinkedIn (Kafka), and Twitter (Storm) are
entering the stage. Platforms like Erlang and Node.js are also gaining popularity
because distributed systems are built differently now. Monolithic enterprise
302 Web Scalability for Startup Engineers
servers with distributed transactions, locking, and synchronous processing seem
to be fading into the past. We are moving into an era of lightweight, innovative,
and highly parallel technologies, and startups should be investing in these types of
solutions. EDA and asynchronous processing are going through their renaissance,
and they are most likely going to become even more popular, so learning about
them now is a good investment for every engineer.
8CHAPTER
Searching for Data
303
304 Web Scalability for Startup Engineers
Structuring your data, indexing it efficiently, and being able to perform more
complex searches over it is a serious challenge. As the size of your data
set grows from gigabytes to terabytes, it becomes increasingly difficult to
find the data you are looking for efficiently. Any time you read, update, delete, or
even insert new data, your applications and data stores need to perform searches
to be able to locate the right rows (or data structures) that need to be read and
written. To be able to understand better how to search through billions of records
efficiently, you first need to get familiar with how indexes work.
Introduction to Indexing
Being able to index data efficiently is a critical skill when working with scalable
websites. Even if you do not intend to be an expert in this field, you need to have
a basic understanding of how indexes and searching work to be able to work with
ever-growing data sets.
Let’s consider an example to explain how indexes and searching work. Let’s say
that you had personal data of a billion users and you needed to search through
it quickly (I use a billion records to make scalability issues more apparent here,
but you will face similar problems on smaller data sets as well). If the data set
contained first names, last names, e-mail addresses, gender, date of birth, and an
account number (user ID), in such a case your data could look similar to Table 8-1.
If your data was not indexed in any way, you would not be able to quickly find
users based on any criteria. The only way to find a user would be to scan the entire
data set, row by row. If you had a billion users and wanted to check if a particular
e-mail address was in your database, you would need to perform up to a billion
comparisons. In the worst-case scenario, when a user was not in your data set, you
would need to perform one billion comparisons (as you cannot be sure that user is
not there until you check all of the rows). It would also take you, on average, half a
User ID First Name Last Name E-mail Gender Date of Birth
135 John Doe [email protected] Male 10/23/86
70 Male 02/18/75
260 Richard Roe [email protected] Female 01/15/74
… … …
Marry Moe [email protected]
… ……
Table 8-1 Sample of Test User Data Set
Chapter 8: Searching for Data 305
billion comparisons to find a user that exists in your database, because some users
would live closer to the beginning and others closer to the end of the data set.
A full table scan is often the term used for this type of search, as you need
to scan the entire data set to find the row that you are looking for. As you can
imagine, that type of search is expensive. You need to load all of the data from
disk into memory to be able to perform comparisons and check if the row at hand
is the one you are looking for. A full table scan is pretty much the worst-case
scenario when it comes to searching, as it has O(n) cost.
Big O notation is a way to compare algorithms and estimate their
cost. In simple terms, Big O notation tells you how the amount of work
changes as the size of the input changes. Imagine that n is the number
of rows in your data set (the size) and the Big O notation expression
estimates the cost of executing your algorithm over that data set. When
you see the expression O(n), it means that doubling the size of the data
set roughly doubles the cost of the algorithm execution. When you see
the expression O(n^2), it means that as your data set doubles in size,
the cost grows quadratically (much faster than linear).
Because a full table scan has a linear cost, it is not an efficient way to search large
data sets. A common way to speed up searching is to create an index on the data that
you are going to search upon. For example, if you wanted to search for users based
on their e-mail address, you would create an index on the e-mail address field.
In a simplified way, you can think of an index as a lookup data structure, just
like a book index. To build a book index, you sort terms (keywords) in alphabetic
order and map each of them to a page number. When readers want to find pages
referring to a particular term, they can quickly locate the term in the index
and find page numbers that they should look at. Figure 8-1 shows how data is
structured in a book index.
There are two important properties of an index:
▶▶ An index is structured and sorted in a specific way, optimized for particular
types of searches. For example, a book index can answer questions like
“What pages refer to the term sharding?” but it cannot answer questions
like “What pages refer to more than one term?” Although both questions
refer to locations of terms in the book, a book index is not optimized to answer
the second question efficiently.
306 Web Scalability for Startup Engineers
Mapping is performed from a
keyword to a list of sorted
page numbers.
Terms are Abstraction 6,9,33,43
sorted backup 9
DNS 78
alphabetically redundancy 32,94,145
in this direction. replication 54
sharding 56,77
simplicity 3,23,55
single responsibility principle 30
Figure 8-1 Book index structure
▶▶ The data set is reduced in size because the index is much smaller in size
than the overall body of text so that the index can be loaded and processed
faster. A 400-page book may have an index of just a few pages. That makes
searching for terms faster, as there is less content to search through.
The reason why most indexes are sorted is that searching through a sorted data
set can be performed much faster than through an unsorted one. A good example
of a simple and fast searching algorithm is the binary search algorithm. When
using a binary search algorithm, you don’t scan the data set from the beginning to
the end, but you “jump around,” skipping large numbers of items. The algorithm
takes a range of sorted values and performs four simple steps until the value is
found:
1. You look at the middle item of the data set to see if the value is equal, greater
to, or smaller than what you are searching for.
2. If the value is equal, you found the item you were looking for.
3. If the value is greater than what you are looking for, you continue searching
through the smaller items. You go back to step 1 with the data set reduced
by half.
4. If the value is smaller than what you are looking for, you continue searching
through the larger items. You go back to step 1 with the data set reduced
by half.
Chapter 8: Searching for Data 307
1. Look at the middle of the list. Is it 75? 2. Divide the remaining list into half.
Look at the middle of the list. Is it 75?
3. Repeat recursively.
11 14 19 21 26 28 30 33 34 41 46 48 52 53 55 56 61 62 67 70 75 79 86 88 93
You inspect the central element and decide if you should look to the right
or to the left. By making this decision you divide the remaining list by half
wtih every step.
Figure 8-2 Searching for number 75 using binary search
Figure 8-2 shows how binary search works on a sequence of sorted numbers.
As you can see, you did not have to investigate all of the numbers.
The brilliance of searching using this method is that with every comparison
you reduce the number of items left by half. This in turn allows you to narrow
down your search rapidly. If you had a billion user IDs, you would only need to
perform, on average, 30 comparisons to find what you are looking for! If you
remember, a full table scan would take, on average, half a billion comparisons to
locate a row. The binary search algorithm has a Big O notation cost of O(log n),
which is much lower than the O(n) cost of a full table scan. 2
It is worth getting familiar with Big O notation, as applying the right algorithms
and data structures becomes more important as your data set grows. Some of the
most common Big O notation expressions are O(n^2), (n*log(n)), O(n), O(log(n)),
and O(1). Figure 8-3 shows a comparison of these curves, with the horizontal
axis being the data set size and the vertical axis showing the relative computation
cost. As you can see, the computational complexity of O(n^2) grows very rapidly,
causing even small data sets to become an issue. On the other hand, O(log(n))
grows so slowly that you can barely notice it on the graph. In comparison to the
other curves, O(log(n)) looks more like a constant time O(1) than anything else,
making it very effective for large data sets.
Indexes are great for searching, but unfortunately, they add some overhead.
Maintaining indexes requires you to keep additional data structures with sorted
lists of items, and as the data set grows, these data structures can become large
and costly. In this example, indexing 1 billion user IDs could grow to a monstrous
16GB of data. Assuming that you used 64-bit integers, you would need to store
8 bytes for the user ID and 8 bytes for the data offset for every single user. At such
scale, adding indexes needs to be well thought out and planned, as having too
308 Web Scalability for Startup Engineers n*log(n) n
nˆ2
Y axis
represents
relative
computational
complexity
log(n)
1
X axis represents the size of
the data set
Figure 8-3 Big O notation curves
many indexes can cost you a great amount of memory and I/O (the data stored
in the index needs to be read from the disk and written to it as well).
To make it even more expensive, indexing text fields like e-mail addresses takes
more space because the data being indexed is “longer” than 8 bytes. On average,
e-mail addresses are around 20 bytes long, making indexes even larger.
Considering that indexes add overhead, it is important to know what data is
worth indexing and what is not. To make these decisions, you need to look at the
queries that you intend to perform on your data and the cardinality of each field.
Cardinality is a number of unique values stored in a particular field.
Fields with high cardinality are good candidates for indexes, as they
allow you to reduce the data set to a very small number of rows.
Chapter 8: Searching for Data 309
To explain better how to estimate cardinality, let’s take a look at the example
data set again. The following are all of the fields with estimated cardinality:
▶▶ gender In most databases, there would be only two genders available,
giving us very low cardinality (cardinality ~ 2). Although you can find
databases with more genders (like transsexual male), the overall cardinality
would still be very low (a few dozen at best).
▶▶ date of birth Assuming that your users were mostly under 80 years old
and over 10 years old, you end up with up to 25,000 unique dates (cardinality
~ 25000). Although 25,000 dates seems like a lot, you will still end up with
tens or hundreds of thousands of users born on each day, assuming that
distribution of users is not equal and you have more 20-year-old users than
70-year-old ones.
▶▶ first name Depending on the mixture of origins, you might have tens of
thousands of unique first names (cardinality ~ tens of thousands).
▶▶ last name This is similar to first names (cardinality ~ tens of thousands).
▶▶ email address If e-mail addresses were used to uniquely identify accounts
in your system, you would have cardinality equal to the total number of
rows (cardinality = 1 billion). Even if you did not enforce e-mail address
uniqueness, they would have few duplicates, giving you a very high cardinality.
▶▶ user id Since user IDs are unique, the cardinality would also be 1 billion
(cardinality = 1 billion).
The reason why low-cardinality fields are bad candidates for indexes is that they
do not narrow down the search enough. After traversing the index, you still have a
lot of rows left to inspect. Figure 8-4 shows two indexes visualized as sorted lists.
The first index contains user IDs and the location of each row in the data set. The
second index contains the gender of each user and reference to the data set.
Both of the indexes shown in Figure 8-4 are sorted and optimized for different
types of searches. Index A is optimized to search for users based on their user ID,
and index B is optimized to search for users based on their gender.
The key point here is that searching for a match on the indexed field is fast,
as you can skip large numbers of rows, just as the binary search algorithm
does. As soon as you find a match, though, you can no longer narrow down
the search efficiently. All you can do is inspect all remaining rows, one by one.
In this example, when you find a match using index A, you get a single item; in
comparison, when you find a match using index B, you get half a billion items.
310 Web Scalability for Startup Engineers
Index A Index B
User ID Pointer to Gender Pointer to
the data the data
You can nd user ... ... Each entry in the index is a pair of ... ...
id = 234443 very 234424 6492492 user id and a pointer to the F 6696828
quickly because 234425 6421399 F 5980619
234426 7016517 location of the data (where entire F 8404006
ids are sorted. 234427 4109933 row is stored). F 2834638
Once you nd it 234430 8969427 F 7848756
you are left with a 234431 1932573 Index B is sorted by gender. You F 4959893
single pointer to 234432 7022404 can nd all males very quickly. F 6099663
data. You found a 234433 3480085 Once you nd the section of the M 7128274
single row; only a 234434 7639679 index containing males you are M 1502792
single row needs 234435 6886959 left with tons of pointers. That in turn M 4389734
to be processed. 234442 1720247 means that all of these rows need M 8242479
234443 2007066 to be fetched from the data le and M 6941149
234444 3288350 M 2691054
234445 6941149 processed. M 1529331
... ... ... ...
Figure 8-4 Field cardinality and index efficiency
HINT
The first rule of thumb when creating indexes on a data set is that the higher the cardinality, the
better the index performance.
The cardinality of a field is not the only thing that affects index performance.
Another important factor is the item distribution. If you were indexing a field
where some values were present in just a single item and others were present in
millions of records, then performance of the index would vary greatly depending
on which value you look for. For example, if you indexed a date of birth field, you
are likely going to end up with a bell curve distribution of users. You may have a
single user born on October 4, 1923, and a million users born on October 4, 1993.
In this case, searching for users born on October 4, 1923, will narrow down the
search to a single row. Searching for users born on October 4, 1993, will result in
a million items left to inspect and process, making the index less efficient.
HINT
The second rule of thumb when creating indexes is that equal distribution leads to better index
performance.
Chapter 8: Searching for Data 311
Luckily, indexing a single field and ending up with a million items is not the
only thing you can do. Even when cardinality or distribution of values on a single
field is not great, you can create indexes that have more than one field, allowing
you to use the second field to narrow down your search further.
A compound index, also known as a composite index, is an index
that contains more than one field. You can use compound indexes to
increase search efficiency where cardinality or distribution of values of
individual fields is not good enough.
If you use compound indexes, in addition to deciding which fields to index,
you need to decide in what order they should be indexed. When you create a
compound index, you effectively create a sorted list ordered by multiple columns.
It is just as if you sorted data by multiple columns in Excel or Google Docs.
Depending on the order of columns in the compound index, the sorting of data
changes. Figure 8-5 shows two indexes: index A (indexing first name, last name,
and date of birth) and index B (indexing last name, first name, and date of birth).
Index A Index B
First Last DOB Pointer to Last First DOB Pointer to
Name Name the Data Name Name the Data
CHARLES LEE 12-Jan-1985 6492492 ANDERSON LARRY 9-Jan-1984 8242479
DAVID LEE 14-Aug-1984 6421399 ANDERSON SCOTT 9-Mar-1978 2834638
DONALD JACKSON 19-Mar-1986 7016517 ANDERSON THOMAS 7-Nov-1985 7848756
FRANK GARCIA 6-Dec-1978 8969427 GARCIA FRANK 6-Dec-1978 8969427
FRANK GARCIA 16-Jun-1981 1932573 GARCIA FRANK 16-Jun-1981 1932573
FRANK LEE 11-Sep-1980 6886959 GARCIA GARY 8-Apr-1984 6099663
GARY GARCIA 8-Apr-1984 6099663 GARCIA GARY 7-Oct-1986 2007066
GARY GARCIA 7-Oct-1986 2007066 JACKSON DONALD 19-Mar-1986 7016517
GARY LEE 28-Mar-1986 3480085 JACKSON JOHN 2-Dec-1983 7639679
GARY LEE 2-Sep-1986 6941149 JACKSON MARK 7-Aug-1985 2691054
JAMES THOMAS 22-Oct-1982 7022404 JACKSON ROBERT 13-Dec-1983 8404006
JOHN JACKSON 2-Dec-1983 7639679 LEE CHARLES 12-Jan-1985 6492492
JOSEPH THOMAS 3-Apr-1985 1720247 LEE DAVID 14-Aug-1984 6421399
LARRY ANDERSON 9-Jan-1984 8242479 LEE FRANK 11-Sep-1980 6886959
LARRY LEE 14-Apr-1981 4959893 LEE GARY 28-Mar-1986 3480085
MARK JACKSON 7-Aug-1985 2691054 LEE GARY 2-Sep-1986 6941149
ROBERT JACKSON 13-Dec-1983 8404006 LEE LARRY 14-Apr-1981 4959893
SCOTT ANDERSON 9-Mar-1978 2834638 THOMAS JAMES 22-Oct-1982 7022404
THOMAS ANDERSON 7-Nov-1985 7848756 THOMAS JOSEPH 3-Apr-1985 1720247
Figure 8-5 Ordering of columns in a compound index
312 Web Scalability for Startup Engineers
The most important thing to understand here is that indexes A and B are
optimized for different types of queries. They are similar, but they are not equal,
and you need to know exactly what types of searches you are going to perform to
choose which one is better for your application.
Using index A, you can efficiently answer the following queries:
▶▶ get all users where first name = Gary
▶▶ get all users where first name = Gary and last name = Lee
▶▶ get all users where first name = Gary and last name = Lee and
date of birth = March 28, 1986
Using index B, you can efficiently answer the following queries:
▶▶ get all users where last name = Lee
▶▶ get all users where last name = Lee and first name = Gary
▶▶ get all users where last name = Lee and first name = Gary and
date of birth = March 28, 1986
As you might have noticed, queries 2 and 3 in both cases can be executed
efficiently using either one of the indexes. The order of matching values would be
different in each case, but it would result in the same number of rows being found
and both indexes would likely have comparable performance.
To make it more interesting, although both indexes A and B contain date of
birth, it is impossible to efficiently search for users born on April 8, 1984, without
knowing their first and last names. To be able to search through index A, you
need to have a first name that you want to look for. Similarly, if you want to search
through index B, you need to have the user’s last name. Only when you know
the exact value of the leftmost column can you narrow down your search by
providing additional information for the second and third columns.
Understanding the indexing basics presented in this section is absolutely
critical to being able to design and implement scalable web applications. In
particular, if you want to use NoSQL data stores, you need to stop thinking of
data as if it were stored in tables and think of it as if it were stored in indexes. Let’s
explore this idea in more detail and see how you can optimize your data model for
fast access despite large data size.
Chapter 8: Searching for Data 313
Modeling Data
When you use NoSQL data stores, you need to get used to thinking of data as if it
were an index.
The main challenge when designing and building the data layer of a scalable
web application is identifying access patterns and modeling your data based on
these access patterns. Data normalization and simple rules of thumb learned
from relational databases are not enough when working with terabytes of data.
Huge data sets and the technical limitations of data stores will force you to design
your data model much more carefully and consider use cases over the data
relationships.
To be able to scale your data layer, you need to analyze your access patterns
and use cases, select a data store, and then design the data model. To make it
more challenging, you need to keep the data model as flexible as possible to allow
for future extensions. At the same time, you want to optimize it for fast access to
keep up with the growth of the data size.
These two forces often conflict, as optimizing the data model usually reduces
the flexibility; conversely, increasing flexibility often leads to worse performance
and scalability. In the following subsections we will discuss some NoSQL
modeling techniques and concrete NoSQL data model examples to better explain
how it is done in practice and what tradeoffs you need to prepare yourself for.
Let’s start by looking at NoSQL data modeling.
NoSQL Data Modeling
If you used relational databases before, you are likely familiar with the process of
data modeling. When designing a relational database schema, you would usually
start by looking at the data itself. You would ask yourself, “What is the data that I
need to store?” You would then go through all of the bits of information that need
to be persisted and isolate entities (database tables). You would then decide which
pieces of information should be stored in which table. You would also create
relationships between tables using foreign keys. You would then iterate over
the schema design, trying to reduce the amount of redundant data and circular
relationships.
As a result of this process, you would usually end up with a normalized and
flexible database schema that could be used to answer almost any type of question
using SQL queries. You would usually finish this process without thinking much
about particular features or what feature would need to execute what types of
314 Web Scalability for Startup Engineers
queries. Your schema would be designed mainly based on the data itself, not
queries or use cases. Later on, as you implement your application and new types
of queries are needed, you would create new indexes to speed up these queries,
but the data schema would usually remain unchanged, as it would be flexible
enough to handle any type of query.
Unfortunately, that process of design and normalization focused on data does
not work when applied to NoSQL data stores. NoSQL data stores trade data model
flexibility (and ability to perform joins) for scalability, so you need to find a different
approach.
To be able to model data in NoSQL data stores and access it efficiently, you
need to change the way you design your schema. Rather than starting with data in
mind, you need to start with queries in mind. I would argue that designing a data
model in the NoSQL world is more difficult than it is in the relational database
world. Once you optimize your data model for particular types of queries, you
usually lose the ability to perform other types of queries. Designing a NoSQL data
model is much more about tradeoffs and data layout optimization than it is about
normalization.
When designing a data model for a NoSQL data store, you want to identify all
the queries and access patterns first. Only once you understand how your data
will be accessed can you move on to identifying key pieces of data and looking for
ways to structure it so that you could execute all of your query types efficiently.
For example, if you were designing an e-commerce website using a relational
database, you might not think much about how data would be queried. You might
decide to have separate tables for products, product categories, and product
reviews. Figure 8-6 shows how your data model might look.
If you were designing the same type of e-commerce website and you had to
support millions of products with millions of user actions per hour, you might
decide to use a NoSQL data store to help you scale the system. You would then
have to model your data around your main queries rather than using a generic
normalized model.
For example, if your most important queries were to load a product page
to display all of the product-related metadata like name, image URL, price,
categories it belongs to, and average user ranking, you might optimize your data
model for this use case rather than keeping it normalized. Figure 8-7 shows an
alternative data model with example documents in each of the collections.
By grouping most of the data into the product entity, you would be able to
request all of that data with a single document access. You would not need to join
tables, query multiple servers, or traverse multiple indexes. Rendering a product
page could then be achieved by a single index lookup and fetching of a single
Chapter 8: Searching for Data 315
product_categories product_reviews
+product_id +product_id
+category_id +user_id
+rating
0..n 1..n +headline
+comment
0..n 1..n
categories products users
+id +id +id
+name +name + rst_name
+parent_category +price +last_name
+description ...
...
Figure 8-6 Relational data model
Collection of products Collection of users
{ {
"id": 6329103, "id": 4234,
"name": "Digital wall clock", "userName": "Sam",
"price": 59.95, "email": "[email protected]",
"description": "...", "yearOfBirth": 1981,
"thumbnail": "http://example.org/img/6329103.jpg", ...
"categories": ["clocks", "kitchen", "electronics"], }
"categoryIds": [4123, 53452, 342], ...
"avgRating": 3.75,
"recentComments": [ Collection of product reviews
{
{
"id": 6523123, "id": 6523123,
"userId": 4234, "product": {
"userName": "Sam",
"rating": 5, "id": 6329103,
"comment": "That is the coolest clock I ever had." "name": "Digital wall clock",
"price": 59.95,
} "thumbnail": "http://example.org/img/6329103.jpg"
... },
] "user": {
} "id": 4234,
... "userName": "Sam"
},
"rating": 5,
"comment": "That is the coolest clock I ever had."
}
...
Figure 8-7 Nonrelational data model
316 Web Scalability for Startup Engineers
document. Depending on the data store used, you might also shard data based
on the product ID so that queries regarding different products could be sent to
different servers, increasing your overall capacity.
There are considerable benefits and drawbacks of data denormalization and
modeling with queries in mind. Your main benefit is performance and ability to
efficiently access data despite a huge data set. By using a single index and a single
“table,” you minimize the number of disk operations, reducing the I/O pressure,
which is usually the main bottleneck in the data layer.
On the other hand, denormalization introduces data redundancy. In this
example, in the normalized model (with SQL tables), category names live in a
separate table and each product is joined to its categories by a product_categories
table. This way, category metadata is stored once and product metadata is
stored once (product_categories contains references only). In the denormalized
approach (NoSQL-like), each product has a list of category names embedded.
Categories do not exist by themselves—they are part of product metadata. That
leads to data redundancy and, what is more important here, makes updating data
much more difficult. If you needed to change a product category name, you would
need to update all of the products that belong to that category, as category names
are stored within each product. That could be extremely costly, especially if you
did not have an index allowing you to find all products belonging to a particular
category. In such a scenario, you would need to perform a full table scan and
inspect all of the products just to update a category name.
HINT
Flexibility is one of the most important attributes of good architecture. To quote Robert C. Martin
again, “Good architecture maximizes the number of decisions not made.” By denormalizing
data and optimizing for certain access patterns, you are making a tradeoff. You sacrifice some
flexibility for the sake of performance and scalability. It is critical to be aware of these tradeoffs
and make them very carefully.
As you can see, denormalization is a double-edged sword. It helps us optimize
and scale, but it can be restricting and it can make future changes much more
difficult. It can also easily lead to a situation where there is no efficient way to
search for data and you need to perform costly full table scans. It can also lead
to situations where you need to build additional “tables” and add even more
redundancy to be able to access data efficiently.
Regardless of the drawbacks, data modeling focused on access patterns and use
cases is what you need to get used to if you decide to use NoSQL data stores. As
mentioned in Chapter 5, NoSQL data stores are more specialized than relational
Chapter 8: Searching for Data 317
database engines and they require different data models. In general, NoSQL data
stores do not support joins and data has to be grouped and indexed based on the
access patterns rather than based on the meaning of the data itself.
Although NoSQL data stores are evolving very fast and there are dozens of
open-source projects out there, the most commonly used NoSQL data stores can
be broadly categorized based on their data model into three categories:
▶▶ Key-value data stores These data stores support only the most simplistic
access patterns. To access data, you need to provide the key under which
data was stored. Key-value stores have a limited programming interface—
basically all you can do is set or get objects based on their key. Key-value
stores usually do not support any indexes or sorting (other than the
primary key). At the same time, they have the least complexity and they
can implement automatic sharding based on the key, as each value is
independent and the only way to access values is by providing their keys.
They are good for fast one-to-one lookups, but they are impractical when
you need sorted lists of objects or when you need to model relationships
between objects. Examples of key-value stores are Dynamo and Riak.
Memcached is also a form of a key-value data store, but it does not persist
data, which makes it more of a key-value cache than a data store. Another
data store that is sometimes used as a key-value store is Redis, but it has
more to offer than just key-value mappings.
▶▶ Wide columnar data stores These data stores allow you to model data as
if it was a compound index. Data modeling is still a challenge, as it is quite
different from relational databases, but it is much more practical because
you can build sorted lists. There is no concept of a join, so denormalization
is a standard practice, but in return wide columnar stores scale very well.
They usually provide data partitioning and horizontal scalability out of the
box. They are a good choice for huge data sets like user-generated content,
event streams, and sensory data. Examples of wide columnar data stores are
BigTable, Cassandra, and HBase.
▶▶ Document-oriented data stores These data stores allow more complex
objects to be stored and indexed by the data store. Document-based data
stores use a concept of a document as the most basic building block in their
data model. Documents are data structures that can contain arrays, maps,
and nested structures just as a JSON or XML document would. Although
documents have flexible schemas (you can add and remove fields at will on a
per-document basis), document data stores usually allow for more complex
indexes to be added to collections of documents. Document stores usually
318 Web Scalability for Startup Engineers
offer a fairly rich data model, and they are a good use case for systems
where data is difficult to fit into a predefined schema (where it is hard to
create a SQL-like normalized model) and at the same time where scalability
is required. Examples of document-oriented data stores are MongoDB,
CouchDB, and Couchbase.
There are other types of NoSQL data stores like graph databases and object
stores, but they are still much less popular. Going into more detail about each
of these data store types is outside the scope of this book, especially because
the NoSQL data store market is very fragmented, with each of the data stores
evolving in a slightly different direction to satisfy specialized niche needs.
Instead of trying to cover all of the possible data stores, let’s have a look at a
couple of data model examples to see how NoSQL modeling works in practice.
Wide Column Storage Example
Consider an example where you needed to build an online auction website similar
in concept to eBay. If you were to design the data model using the relational
database approach, you would begin by looking for entities and normalize the
model. As I mentioned before, in the NoSQL world, you need to start by looking
at what queries you are going to perform, not just what data you are going to store.
Let’s say that you had the following list of use cases that need to be satisfied:
1. Users need to be able to sign up and log in.
2. Logged-in users can view auction details by viewing the item auction page.
3. The item auction page contains product information like title, description,
and images.
4. The item auction page also includes a list of bids with the names of users who
placed them and the prices they offered.
5. Users need to be able to change their user name.
6. Users can view the history of bids of other users by viewing their profile pages.
The user profile page contains details about the user like name and reputation
score.
7. The user profile page shows a list of items that the user placed bids on. Each
bid displays the name of the item, a link to the item auction page, and a price
that the user offered.
8. Your system needs to support hundreds of millions of users and tens of
millions of products with millions of bids each.
Chapter 8: Searching for Data 319
After looking at the use cases, you might decide to use a wide columnar data
store like Cassandra. By using Cassandra, you can leverage its high availability
and automated horizontal scalability. You just need to find a good way to model
these use cases to make sure that you can satisfy the business needs.
The Cassandra data model is often represented as a table with an unlimited
number of rows and a nearly unlimited number of arbitrary columns, where each
row can have different columns, and column names can be made up on the spot
(there is no table definition or strict schema and columns are dynamically created
as you add fields to the row). Figure 8-8 shows how the Cassandra table is usually
illustrated.
Each row has a row key, which is a primary key and at the same time a sharding
key of the table. The row key is a string—it uniquely identifies a single row and
it is automatically indexed by Cassandra. Rows are distributed among different
servers based on the row key, so all of the data that needs to be accessed together
in a single read needs to be stored within a single row. Figure 8-8 also shows
that rows are indexed based on the row key and columns are indexed based on a
column name.
The way Cassandra organizes and sorts data in tables is similar to the way
compound indexes work. Any time you want to access data, you need to provide
a row key and then column name, as both of these are indexed. Because columns
Columns are sorted by the column name, independently for each row.
Row key Columns
[email protected]
dob home_address password postCode
Rows in 1981–12–29 4 Small Rd, Sydney d417dd9ba812272 2000
Cassandra
table are Row key Columns
sorted based [email protected]
on the row dob password postCode state
1975-03-13 NSW
key.
84408c34214634e 2153
Cassandra table can E-mail address is used as a row key Each row can have different
have an unlimited here. Row key needs to be a unique columns and you can have
number of rows. millions of columns per row.
identi er of a row.
Figure 8-8 Fragments of two rows in a Cassandra table
320 Web Scalability for Startup Engineers
are stored in sorted order, you can perform fast scans on column names to
retrieve neighboring columns. Since every row lives on its own and there is no
table schema definition, there is no way to efficiently select multiple rows based
on a particular column value.
You could visualize a Cassandra table as if it was a compound index. Figure 8-9
shows how you could define this model in a relational database like MySQL. The
index would contain a row key as the first field and then column name as the second
field. Values would contain actual values from Cassandra fields so that you would
not need to load data from another location.
As I mentioned before, indexes are optimized for access based on the fields
that are indexed. When you want to access data based on a row key and a column
name, Cassandra can locate the data quickly, as it traverses an index and does not
need to perform any table scans. The problem with this approach is that queries
that do not look for a particular row key and column may be inefficient because
they require expensive scans.
HINT
Many NoSQL data modeling techniques can be boiled down to building compound indexes so that
data can be located efficiently. As a result, queries that can use the index perform very well, but
queries that do not use the index require a full table scan.
Going back to the example of an online auction website, you could model users
in Cassandra by creating a users table. You could use the user’s e-mail address (or
Row key Index Structure Value
Column name
If you represented the Cassandra ... ... ...
table as a compound index then each [email protected] dob 1981-12-29
row would have as many entries in [email protected] home_address 4 Small Rd, Sydney
the index as it had columns in the [email protected] password d417dd9ba812272
table. s..a. [email protected] p..o. stCode .2.0. 00
You can quickly locate any row and [email protected] dob 1975-03-13
within the row you can scan columns 84408c34214634e
in alphabetical order or nd a column [email protected] password 2153
by name (both using the index). .N..SW
Finding a password of a particular [email protected] postCode
user ([email protected]) is very fast,
no matter how many rows and columns .s.a. [email protected] state
you have. ...
Figure 8-9 Cassandra table represented as if it were a compound index
Chapter 8: Searching for Data 321
user name) as a row key of the users table so that you could find users efficiently
when they are logging in to the system. To make sure you always have the user’s
row key, you could then store it in the user’s HTTP session or encrypted cookies
for the duration of the visit.
You would then store user attributes like first name, last name, phone number,
and hashed password in separate columns of the users table. Since there is no
predefined column schema in Cassandra, some users might have additional
columns like billing address, shipping address, or contact preference settings.
HINT
Any time you want to query the users table, you should do it for a particular user to avoid
expensive table scans. As I mentioned before, Cassandra tables behave like compound indexes.
The row key is the first field of that “index,” so you always need to provide it to be able to
search for data efficiently. You can also provide column names or column name ranges to find
individual attributes. The column name is the second field of the “compound index” and providing
it improves search speed even further.
In a similar way, you could model auction items. Each item would be uniquely
identified by a row key as its ID. Columns would represent item attributes like
title, image URLs, description, and classification. Figure 8-10 shows how both
the users table and items table might look. By having these two tables, you can
efficiently find any item or user by their row key.
To satisfy more use cases, you would also need to store information about
which users placed bids on which items. To be able to execute all of these queries
efficiently, you would need to store this information in a way that is optimized for
two access patterns:
▶▶ Get all bids of a particular item (use case 4)
▶▶ Get all bids of a particular user (use case 7)
To allow these access patterns, you need to create two additional “indexes”: one
indexing bids by item ID and the second one indexing bids by user ID. As of this
writing Cassandra does not allow you to index selected columns, so you need to
create two additional Cassandra tables to provide these indexes. You could create
one table called item_bids to store bids per item and a second table called user_
bids to store bids per user.
Alternatively, you could use another feature of Cassandra called column
families to avoid creating additional tables. By using column families, you would
322 Web Scalability for Startup Engineers
Example Row from a Users Table
Row key Columns
[email protected]
dob homeAddress password postCode
1981–12–29
4 Small Rd, Sydney d417dd9ba812272 2000
Example Row from an Items Table
Row key Columns
345632
classi cations description image1 image2 title
4232,12,55,123 Have you http://example.. http://example.. Digital wall clock
ever ...
Figure 8-10 User and item tables
still end up with denormalized and duplicated data, so for simplicity’s sake I
decided to use separate tables in this example. Any time a user places a bid, your
web application needs to write to both of these data structures to keep both of
these “indexes” in sync. Luckily, Cassandra is optimized for writes and writing to
multiple tables is not a concern from a scalability or performance point of view.
Figure 8-11 shows how these two tables might look. If you take a closer look at
the user_bids table, you may notice that column names contain timestamps and
item IDs. By using this trick, you can store bids sorted by time and display them
on the user’s profile page in chronological order.
By storing data in this way you are able to write into these tables very
efficiently. Any time you need to place a bid, you would serialize bid data and
simply issue two commands:
▶▶ set data under column named “$time|$item_id” for a row “$user_email” in
table user_bids
▶▶ set data under column named “$time|$user_email” for a row “$item_id” in
table item_bids
Chapter 8: Searching for Data 323
Example Row from a user_bids Table
Row key 1418879899|345632 Columns Older bids of that user
[email protected] {”product”:{”id”:345632, 1395119895|655632
“name”:”Digital wall clock”,
“price”:59.95, {”product”:{”id”:655632
“thumbnail”:”http://ex... “name”:”Samsung Galaxy
S5”, “price”:350.00,
“thumbnail”:”http://ex...
Example Row from an item_bids Table
Row key Columns
345632
1418879899|[email protected] 1418879654|[email protected] Older bids of this item
{”product”:{”id”:345632, {”product”:{”id”:345632,
“name”:”Digital wall clock”, “name”:”Digital wall clock”,
“price”:59.95, “price”:59.95,
“thumbnail”:”http://ex... “thumbnail”:”http://ex...
Figure 8-11 Tables storing bids based on item and user
Cassandra is an eventually consistent store, so issuing writes this way ensures
that you never miss any writes. Even if some writes are delayed, they still end up
in the same order on all servers. No matter how long it takes for such a command
to reach each server, data is always inserted into the right place and stored in the
same order. In addition, order of command execution on each server becomes
irrelevant, and you can issue the same command multiple times without affecting
the end result (making these commands idempotent).
It is also worth noting here that bid data would be denormalized and
redundant, as shown in Listing 8-1. You would set the same data in both user_
bids and item_bids tables. Serialized bid data would contain enough information
about the product and the bidding user so that you would not need to fetch
additional values from other tables to render bids on the user’s profile page or
item detail pages. This data demineralization would allow you to render an item
page with a single query on the item table and a single column scan on the item_
bids table. In a similar way, you could render the user’s profile page by a single
query to the users table and a single column scan on the user_bids table.
324 Web Scalability for Startup Engineers
Listing 8-1 Serialized bid data stored in column values
{
"product": {
"id": 345632,
"name": "Digital wall clock",
"price": 59.95,
"thumbnail": "http://example.org/img/6329103.jpg"
},
"user": {
"email": "[email protected]",
"name": "Sam",
"avatar": "http://example.org/img/fe6e3424rwe.jpg"
},
"timestamp": 1418879899
}
Once you think your data model is complete, it is critical to validate it against
the list of known use cases. This way, you can ensure that your data model can,
in fact, support all of the access patterns necessary. In this example, you could go
over the following use cases:
▶▶ To create an account and log in (use case 1), you would use e-mail address
as a row key to locate the user’s row efficiently. In the same way, you could
detect whether an account for a given e-mail address exists or not.
▶▶ Loading the item auction page (use cases 2, 3, and 4) would be performed
by looking up the item by ID and then loading the most recent bids from the
item_bids table. Cassandra allows fetching multiple columns starting from
any selected column name, so bids could be loaded in chronological order.
Each item bid contains all the data needed to render the page fragment and
no further queries are necessary.
▶▶ Loading the user page (use cases 6 and 7) would work in a similar way. You
would fetch user metadata from the users table based on the e-mail address
and then fetch the most recent bids from the user_bids table.
▶▶ Updating the user name is an interesting use case (use case 5), as user
names are stored in all of their bids in both user_bids and item_bids tables.
Updating the user name would have to be an offline process because it
requires much more data manipulation. Any time a user decides to update
his or her user name, you would need to add a job to a queue and defer
Chapter 8: Searching for Data 325
execution to an offline process. You would be able to find all of the bids
made by the user using the user_bids table. You would then need to load
each of these bids, unserialize them, change the embedded user name, and
save them back. By loading each bid from the user_bids table, you would
also find its timestamp and item ID. That, in turn, would allow you to issue
an additional SET command to overwrite the same bid metadata in the
item_bids table.
▶▶ Storing billions of bids, hundreds of millions of users, and millions of
items (user case 8) would be possible because of Cassandra’s auto-sharding
functionality and careful selection of row keys. By using user ID and an item
ID as row keys, you are able to partition data into small chunks and distribute
it evenly among a large number of servers. No auction item would receive
more than a million bids and no user would have more than thousands or
hundreds of thousands of bids. This way, data could be partitioned and
distributed efficiently among as many servers as was necessary.
There are a few more tradeoffs that are worth pointing out here. By structuring
data in a form of a “compound index,” you gain the ability to answer certain types
of queries very quickly. By denormalizing the bid’s data, you gain performance
and help scalability. By serializing all the bid data and saving it as a single value,
you avoid joins, as all the data needed to render bid page fragments are present in
the serialized bid object.
On the other hand, denormalization of a bid’s data makes it much more
difficult and time consuming to make changes to redundant data. By structuring
data as if it were an index, you optimize for certain types of queries. This, in turn,
makes some types of queries perform exceptionally well, but all others become
prohibitively inefficient.
Finding a flexible data model that supports all known access patterns and
provides maximal flexibility is the real challenge of NoSQL. For example, using
the data model presented in this example, you cannot efficiently find items with
the highest number of bids or the highest price. There is no “index” that would
allow you to efficiently find this data, so you would need to perform a full table
scan to get these answers. To make things worse, there is no easy way to add an
index to a Cassandra table. You would need to denormalize it further by adding
new columns or tables.
An alternative way to deal with the NoSQL indexing challenge is to use a dedicated
search engine for more complex queries rather than trying to satisfy all use cases with
a single data store. Let’s now have a quick look at search engines and see how they can
complement a data layer of a scalable web application.
326 Web Scalability for Startup Engineers
Search Engines
Nearly every web application needs to perform complex search queries nowadays.
For example, e-commerce platforms need to allow users to search for products
based on arbitrary combinations of criteria like category, price range, brand,
availability, or location. To make things even more difficult, users can also
search for arbitrary words or phrases and apply sorting according to their own
preferences.
Whether you use relational databases or NoSQL data stores, searching through
large data sets with such flexibility is going to be a significant challenge even if
you apply the best modeling practices and use the best technologies available on
the market.
Allowing users to perform such wide ranges of queries requires either building
dozens of indexes optimized for different types of searches or using a dedicated
search engine. Before deciding whether you need a dedicated search engine, let’s
start with a quick introduction to search engines to understand better what they
do and how they do it.
Introduction to Search Engines
You can think of search engines as data stores specializing in searching through
text and other data types. As a result, they make different types of tradeoffs
than relational databases or NoSQL data stores do. For example, consistency
and write performance may be much less important to them than being able to
perform complex searches very fast. They may also have different needs when it
comes to memory consumption and I/O throughput as they optimize for specific
interaction patterns.
Before you begin using dedicated search engines, it is worth understanding
how full text search works itself. The core concept behind full text search and
modern search engines is an inverted index.
An inverted index is a type of index that allows you to search for
phrases or individual words (full text search).
The types of indexes that we discussed so far required you to search for an
exact value match or for a value prefix. For example, if you built an index on a text
Chapter 8: Searching for Data 327
field containing movie titles, you could efficiently find rows with a title equal to
“It’s a Wonderful Life.” Some index types would also allow you to efficiently search
for all titles starting with a prefix “It’s a Wonderful,” but they would not let you
search for individual words in a text field. If your user typed in “Wonderful Life,”
he or she would not find the “It’s a Wonderful Life” record unless you used a full
text search (an inverted index). Using an inverted index allows you to search for
any of the words contained in the text field, regardless of their location and order.
For example, you could search for “Life Wonderful” or “It’s a Life” and still find
the “It’s a Wonderful Life” record.
When you index a piece of text like “The Silence of the Lambs” using an
inverted index, it is first broken down into tokens (like words). Then each of
the tokens can be preprocessed to improve search performance. For example,
all words may be lowercased, plural forms changed to singular, and duplicates
removed from the list. As a result, you may end up with a smaller list of unique
tokens like “the,” “silence,” “of,” “lamb.”
Once you extract all the tokens, you then use them as if they were keywords in
a book index. Rather than adding a movie title in its entirety into the index, you
add each word independently with a pointer to the document that contained it.
Figure 8-12 shows the structure of a simplistic inverted index.
As shown in Figure 8-12, document IDs next to each token are in sorted order
to allow a fast search within a list and, more importantly, merging of lists. Any
time you want to find documents containing particular words, you first find these
words in the dictionary and then merge their posting lists.
Indexed Documents Inverted Index
document id document body Indexing process dictionary posting lists
1 The Silence of the Lambs (document ids)
2 All About Eve 2 6
3 All About My Mother about 2, 3
4 The Angry Silence all 2, 3
5 The Godfather angry 4
6 The Godfather 2 eve 2
godfather 5, 6
Words in the dictionary lamb 1 Document ids in posting
are stored in sorted order. mother 3 lists are stored in sorted
So nding all documents my 3 order which allows fast
with the word “godfather” of 1
can be done very quickly. silence 1, 4 merging.
the 1,4, 5, 6
Figure 8-12 Inverted index structure
328 Web Scalability for Startup Engineers
HINT
The structure of an inverted index looks just like a book index. It is a sorted list of words (tokens)
and each word points to a sorted list of page numbers (document IDs).
Searching for a phrase (“silence” AND “lamb”) requires you to merge posting
lists of these two words by finding an intersection. Searching for words (“silence”
OR “lamb”) requires you to merge two lists by finding a union. In both cases,
merging can be performed efficiently because lists of document IDs are stored
in sorted order. Searching for phrases (AND queries) is slightly less expensive, as
you can skip more document IDs and the resulting merged list is usually shorter
than in the case of OR queries. In both cases, though, searching is still expensive
and carries an O(n) time complexity (where n is the length of posting lists).
Understanding how an inverted index works may help to understand why OR
conditions are especially expensive in a full text search and why search engines
need so much memory to operate. With millions of documents, each containing
thousands of words, an inverted index grows in size faster than a normal index
would because each word in each document must be indexed.
Understanding how different types of indexes work will also help you design
more efficient NoSQL data models, as NoSQL data modeling is closer to
designing indexes than designing relational schemas. In fact, Cassandra was
initially used at Facebook to implement an inverted index and allow searching
through the messages inbox.w27 Having said that, I would not recommend
implementing a full text search engine from scratch, as it would be very
expensive. Instead, I would recommend using a general-purpose search engine
as an additional piece of your infrastructure. Let’s have a quick look at a common
search engine integration pattern.
Using a Dedicated Search Engine
As I mentioned before, search engines are data stores specializing in searching.
They are especially good at full text searching, but they also allow you to index
other data types and perform complex search queries efficiently. Any time you
need to build a rich search functionality over a large data set, you should consider
using a search engine.
A good place to see how complex searching features can become is to look at
used car listings websites. Some of these websites have hundreds of thousands
of cars for sale at a time, which forces them to provide much more advanced
searching criteria (otherwise, users would be flooded with irrelevant offers). As
a result, you can find advanced search forms with literally dozens of fields. You
Chapter 8: Searching for Data 329
can search for anything from free text, mark, model, min/max price, min/max
mileage, fuel type, transmission type, horsepower, and color to accessories like
electric mirrors and heated seats. To make things even more complicated, once
you execute your search, you want to display facets to users to allow them to
narrow down their search even further by selecting additional filters rather than
having to start from scratch.
Complex search functionality like this is where dedicated search engines really
shine. Rather than having to implement complex and inefficient solutions yourself
in your application, you are much better off by using a search engine. There are
a few popular search engines out there: search engines as a service, like Amazon
CloudSearch and Azure Search, and open-source products, like Elasticsearch,
Solr, and Sphinx.
If you decide to use a hosted service, you benefit significantly from not having
to operate, scale, and manage these components yourself. Search engines,
especially the cutting-edge ones, can be quite difficult to scale and operate
in production, unless you have engineers experienced with this particular
technology. You may sacrifice some flexibility and some of the edge-case features,
but you reduce your time to market and the complexity of your operations.
Going into the details of how to configure, scale, and operate search engines
is beyond the scope of this book, but let’s have a quick look at how you could
integrate with one of them. For example, if you decided to use Elasticsearch as
a search engine for your used car sales website, you would need to deploy it in
your data center and index all of your cars in it. Indexing cars using Elasticsearch
would be quite simple since Elasticsearch does not require any predefined
schema. You would simply need to generate JSON documents for each car and
post them to be indexed by Elasticsearch. In addition, to keep the search index
in sync with your primary data store, you would need to refresh the documents
indexed in Elasticsearch any time car metadata changes.
HINT
A common pattern for indexing data in a search engine is to use a job queue (especially since
search engines are near real time anyway). Anytime anyone modifies car metadata, they submit
an asynchronous message for this particular car to be reindexed. At a later stage, a queue worker
picks up the message from the queue, builds up the JSON document with all the information, and
posts to the search engine to overwrite previous data.
Figure 8-13 shows how a search engine deployment could look. All of the searches
would be executed by the search engine. Search results could then be enriched
by real-time data coming from the main data store (if it was absolutely necessary).