brainasebo.blogg.se - Swam engine buffer size too small

Swam engine buffer size too small full#
Swam engine buffer size too small code#
Swam engine buffer size too small Offline#

While running on 20 TB of input, we discovered that we were generating too many output files (each sized around 100 MB) due to the large number of tasks.

At each size increment, we resolved performance and stability issues, but experimenting with 20 TB is where we found our largest opportunity for improvement. We started with a sample of 50 GB of compressed input, then gradually scaled up to 300 GB, 1 TB, and then 20 TB. We started off by converting the most resource intensive part of the Hive-based pipeline: stage two.

Swam engine buffer size too small full#

Spark implementationĭebugging at full scale can be slow, challenging, and resource intensive. When considering the aforementioned limitations of the existing Hive pipeline, we decided to attempt to build a faster and more manageable pipeline with Spark. There was no easy way to gauge the overall progress of the pipeline or calculate an ETA. It was also challenging to manage, because the pipeline contained hundreds of sharded jobs that made monitoring difficult. The Hive-based pipeline building the index took roughly three days to complete.

Shard the table into N number of shards and pipe each shard through a custom binary to generate a custom index file for online querying.

Aggregate on each (entity_id, target_id) pair.

Filter out non-production features and noise.

The three logical steps can be summarized as follows: The Hive-based pipeline was composed of three logical stages where each stage corresponded to hundreds of smaller Hive jobs sharded by entity_id, since running large Hive jobs for each stage was less reliable and limited by the maximum number of tasks per job. In order to enable fresher feature data and improve manageability, we took one of the existing pipelines and tried to migrate it to Spark. The old Hive-based infrastructure built years ago was computationally resource intensive and challenging to maintain because the pipeline was sharded into hundreds of smaller Hive jobs.

Swam engine buffer size too small Offline#

For some of these online serving platforms raw feature values are generated offline with Hive and data loaded into its real-time affinity query system. Real-time entity ranking is used in a variety of ways at Facebook. In the remainder of this article, we describe our experiences and lessons learned while scaling Spark to replace one of our Hive workload Use case: Feature preparation for entity ranking Recently, we felt Spark had matured to the point where we could compare it with Hive for a number of batch-processing use cases.

Swam engine buffer size too small code#

Spark can efficiently leverage larger amounts of memory, optimize code across entire pipelines, and reuse JVMs across tasks for better performance.

It is currently one of the fastest-growing data processing platforms, due to its ability to support streaming, batch, imperative (RDD), declarative (SQL), graph, and machine learning use cases all within the same API and underlying compute engine. Apache Spark was started by Matei Zaharia at UC-Berkeley’s AMPLab in 2009 and was later contributed to Apache in 2013. While the sum of Facebook’s offerings covers a broad spectrum of the analytics space, we continually interact with the open source community in order to share our experiences and also learn from others. We support other types of analytics such as graph processing and machine learning ( Apache Giraph) and streaming (e.g., Puma, Swift, and Stylus). Facebook has also continued to grow its Presto footprint for ANSI-SQL queries against several internal data stores, including Hive. Some of our batch analytics is executed through the venerable Hive platform (contributed to Apache Hive by Facebook in 2009) and Corona, our custom MapReduce implementation. Over the past few years, user and product growth has pushed our analytics engines to operate on data sets in the tens of terabytes for a single query. Facebook often uses analytics for data-driven decision making.