Gateways To Joy
›
Computer Science
›
Distributed Systems
Gateways To Joy
Computer Science
Distributed Systems
8 Jul 2018
Overview
125 Papers in Big Data
(2018).
Distributed Systems book by Miku
(2013).
The Datacenter as a Computer
by Barroso, Clideras and Holzle (2013).
NoSQL Databases: a Survey and Decision Guidance
by Felix Gessert, 2016.
Scalable SQL and NoSQL Data Stores
by Rick Cattell, 2011.
A survey on platforms for big data analytics
by Singh and Reddy, J Big Data, pages 1-8, 2014.
Eventually Consistent
by Werner Vogels, CACM Vol 52, No 1, 2009.
Exposition of CAP Theorem
by Salome Simon, 2012.
A Critique of the CAP Theorem
, 2016.
Architectures
Lambda Architectures
by James Kinley.
Kappa Architecture
by Jay Kreps (2014).
Summingbird: A Framework for Integrating Batch and Online MapReduce Computations
(Twitter) by Boykin, Ritchie, O'Connell and Lin, VLDB 2014.
File Systems
The Google File System
by Ghemawat, Gobioff, Shu-Tak Leung, SOSP 2003.
The Hadoop Distributed File System
by Shvachko, Kuang, Radia and Chansler, 2010.
Ceph: A Scalable, High-Performance Distributed File System
by Weil, Brandt, Miller and Long, OSDI 2006. See
Ceph as a scalable alternative to the Hadoop Distributed File
by Maltzahn, Molina-Estolano, Khurana, Nelson, Brandt, Weil, 2010. System
Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
by Li, Ghodsi, Zaharia, Shenker, Stoica, SOCC 2014.
Data Formatting and Storage
Column-Stores vs. Row-Stores: How Different Are They Really?
by Abadi, Madden and Hachem, SIGMOD 2008.
RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
, ICDE 2011.
Apache Parquet
, slides from 2015.
ORCFile:
Major Technical Advancements in Apache Hive
, 2014.
A Survey on Compression Algorithms in Hadoop
by Lovalekar, 2014.
Erasure Codes for Storage Systems: A Brief Primer
by James Plank, 2013.
Column Oriented Stores
Bigtable: A Distributed Storage System for Structured Data
by various authors at Google, OSDI 2006.
Storage Infrastructure Behind Facebook Messages Using HBase at Scale
, 2012.
Hypertable White Paper
, 2012.
Document Stores
(CouchDB)
Document oriented Databases
, 2015.
MongoDB Architecture
Graph Stores
Neo4J
Titan Documentation
Global Stores
MegaStore: Providing Scalable, Highly Available Storage for Interactive Services
by Google, CIDR 2011.
Spanner: Google's Globally-Distributed Database
by Google, OSDI 2012.
Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
by Google, VLDB 2014.
Cockroach DB Design
The Snowflake Elastic Data Warehouse
by Snowflake Computing, SIGMOD 2016.
Resource Managers
Apache Hadoop YARN: Yet Another Resource Negotiator
by various authors, SOCC 2015.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
by UC Berkeley, NSDI 2011.
Schedulers
REEF: Retainable Evaluator Execution Framework
by various authors, SIGMOD 2015.
Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments
by Rao and Reddy, Intl J of Computer Applications, Vol 34, No 9, Nov 2011.
Coordination
Paxos Made Simple
by Leslie Lamport, 2001. Original paper:
The Part-Time Parliament
by Leslie Lamport, ACM TOCS, Vol 16, No 2, pages 133-169, 1998.
Just say NO to Paxos Overhead: Replacing Consensus with Network Ordering
by U Washington, OSDI 2016.
The Chubby lock service for loosely-coupled distributed systems
by Mike Burrows, OSDI 2006.
ZooKeeper: Wait-free coordination for Internet-scale systems
by Yahoo, USENIX 2010.
(RAFT)
In Search of an Understandable Consensus Algorithm
by Ongaro and Ousterhout, 2014.
Computation Frameworks
Fast and Interactive Analytics over Hadoop Data with Spark
by various authors, 2012.
Apache Flink: Stream and Batch Processing in a Single Engine
by various authors, 2015.
Batch Systems
MapReduce: Simplified Data Processing on Large Clusters
by Jeff Dean and Sanjay Ghemawat, OSDI 2004.
Parallel Data Processing with MapReduce: A Survey
by various authors, SIGMOD Record, 2011.
Graph Computation
Pregel: A System for Large-Scale Graph Processing
by various authors, SIGMOD 2010.
One Trillion Edges: Graph Processing at Facebook-Scale
by various authors, VLDB 2015.
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
by various authors from UC Berkeley and AMP Labs, 2014.
HAMA: An Efficient Matrix Computation with the MapReduce Framework
by various authors, 2010.
An Experimental Comparison of Pregel-like Graph Processing Systems
by various authors, VLDB 2014.
(GraphTau)
Time-Evolving Graph Processing at Scale
by various authors from UC Berkeley, 2016.
Stream Computation
Fast Data Architectures for Streaming Applications
, O'Reilly Book by Dean Wampler, 2016.
Real Time Analytics: Algorithms and Systems
by Kejariwal, Kulkarni and Ramasamy, Twitter, VLDB 2015 Tutorial.
Twitter Heron: Stream Processing at Scale
by various authors from Twitter and U of Wisconsin at Madison, SIGMOD 2015.
Samza: Stateful Scalable Stream Processing at LinkedIn
by various authors, VLDB 2017.
(Spark Streaming)
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
by various authors at UC Berkeley, SOSP 2013.
Kafka Streams
,
Interactive Computation
Dremel: Interactive Analysis of Web-Scale Datasets
by Google authors, VLDB 2010.
Impala: A Modern, Open-Source SQL Engine for Hadoop
by various authors from Cloudera, CIDR 2015.
Apache Drill slides
by Tomar Shiran.
Shark: SQL and Analytics with Cost-Based Query Optimization on Coarse-Grained Distributed Memory
by Antonio Lupher, UC Berkeley, 2014.
Shark: SQL and Rich Analytics at Scale
by various authors from AMP Labs, UC Berkeley, SIGMOD 2013.
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
by various authors from Microsoft Research, EuroSys 2007.
Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications
by various authors from Hortonworks and Microsoft, SIGMOD 2015.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
by various authors, EuroSys 2013.
Kudu: Storage for Fast Analytics on Fast Data
by various authors from Cloudera, 2015.
Realtime Computation
Druid: A Real-time Analytical Data Store
by various authors, SIGMOD 2014.
Pinot architecture
by LinkedIn.
Data Analysis Computation
Apache Zeppelin
slides.
Pig Latin: A Not-So-Foreign Language for Data Processing
by various authors, SIGMOD 2008.
Hive - A Warehousing Solution Over a Map-Reduce Framework
by various authors from Facebook, VLDB 2009.
Apache Phoenix
slides.
Comparative Study Parallel Join Algorithms for MapReduce environment
by A Pigul, 2012.
Machine Learning Frameworks
MLlib: Scalable Machine Learning on Spark
by Xiangrui Meng, DataBricks,
SparkR: Scaling R Programs with Spark
by various authors, SIGMOD 2016.
TensorFlow: A System for Large-Scale Machine Learning
by various authors at Google, OSDI 2016.
SystemML: Declarative Machine Learning on MapReduce
by various authors at IBM Research, ICDE 2011.
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
by various authors, 2015.
Data Ingestion
Sqoop User Guide
Kafka: a Distributed Messaging System for Log Processing
by Kreps, Narkhede and Rao, LinkedIn, NetDB 2011.
ETL and Workflow Systems
Apache NiFi Overview
Apache Beam
—
Uber-API
.
FlumeJava: Easy, Efficient Data-Parallel Pipelines
by various authors at Google, PLDI 2010.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
by various authors at Google, VLDB 2013.
Apache Airflow
documentation.
Apache Crunch
slides.
Apache Falcon (Technical Preview)
by Hortonworks.
Oozie: Workflow Engine for Apache Hadoop
.
Metadata
Apache Atlas
Ground: A Data Context Service
by various authors, CIDR 2017.
Security
Hadoop Security Design
by various authors at Yahoo, 2009.
Apache Metron
Apache Knox (Technical Preview)
by Hortonworks.
Apache Ranger
slides, 2016.
Apache Sentry
Serialization
Protocol Buffers
Introduction to AVRO
by Douglas Creager, 2011.
Monitoring
OpenTSDB
Ambari
Benchmarking
On Big Data Benchmarking
(NDBench)
Netflix Data Benchmark: Benchmarking Cloud Data Stores
, 2016.
Performance Evaluation of NoSQL Systems Using YCSB in a resource Austere Environment
GridMix
, 2008.
Misc Links
CS244B at Stanford
, taught by David Mazieres.
© Copyright 2008—2023, Gurmeet Manku.
gurmeet@gmail.com
Please enable JavaScript to view the
comments powered by Disqus.