Welcome to the first
Tachyon meetup! This will be a chance to learn about Tachyon from the developers, hear about other peoples’ experiences with Tachyon, network, and get to know future development plans.
Thanks to Yahoo! for hosting this event and providing food and drinks.
Abstract:
Memory is the key to fast Big Data processing. This has been realized by many, and frameworks such as Spark and Shark already leverage memory performance. As data sets continue to grow, storage is increasingly becoming a critical bottleneck in many workloads.
To address this need, we have developed
Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. The result of over two years of research, Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.
Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code changes. Tachyon is the default off-heap option in Spark, which means that RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads. The project is open source and is already deployed at multiple companies. In addition, Tachyon has more than
40 contributors from over 15 institutions, including Yahoo, Intel, Redhat, and Pivotal. The project is the storage layer of the Berkeley Data Analytics Stack (
BDAS) and also part of the
Fedora distribution.
In this meetup, Haoyuan Li will give a overview of the project, including motivation, current status, and its roadmap. In addition, we will have a Tachyon tutorial.
Bio:
Haoyuan Li is a Computer Science Ph.D. candidate in AMPLab at UC Berkeley, and he works with Prof. Scott Shenker and Prof. Ion Stoica on big data and cloud computing. He leads Tachyon, an open source memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks. He is a founding committer of Apache Spark and a co-creator of Spark Streaming. Before Berkeley, he worked at Conviva and Google, where he co-created PFP-Growth algorithm, which is included in Apache Mahout. Haoyuan has a M.S. from Cornell University and a B.S. from Peking University, both in Computer Science.