Hadoop:
Spark has become very popular. Yarn is also good.
Job gunna compute something, due some work, and then go away. mapreduce stuff.
We want to get answers asap. Back int the days, Google would scrape every now and then, and update the index. But that is not good for breaking news for example, index won’t be up to date!
There are new ways to solve these problems. As an example, speaker gave an architecture with 4 streams.
Streaming:
SMACK:
Spark: Core is the Apache Spark. On top of it, there are 4 components horizontall:
Streaming Tradeoffs:
Akka: For big data streams, built as part of JVM. Think of it as a stream, data flow. Good for:
Spark good for:
Flink good for:
Kafka streams:
Mesos analgous to YARN. Each framework provides its own scheduler. Mesos offers you resources, and you can either refuse or accept them.
Spark cassandra and kafka work so good together actually.
Kafka has been baked in Netflix for 8 years. Millions of messages go through the kafka queues. It’s pretty stable.
Running through Kafka is still a single point of failure, BUT it’s a distributed system that is highly available, so can trust it. So we can point pour service and internet through kafka, and in turn it will redirect to the appropriate services.
Fast data and microservices: will they converge?
Synergies: they have similar design. each stream/microservice does one thing. They both encourage async, and they both need to be run for a long time with high availibility. So they have good similarities.
Example Twitter: used to be 3-tier architecture. Data was 2nd thought. When twitter grew, twitter became more like streaming pieplines (data 1st), goes through different services, eventually they join, and then some callback is called, and request is sent back to browser.