Paxata Proves Business Value to Customers With Spark-Optimized Self-Service Data Prep Platform Built for Scale

Paxata Demonstrates Spark in Action at Spark Summit Conference

SAN FRANCISCO, June 15, 2015 (GLOBE NEWSWIRE) -- Paxata, provider of the first interactive, self-service Adaptive Data Preparation™ solution at scale, today announced continued market success of its high-performance platform, thanks to ongoing adoption and innovation around Apache Spark v1.3. Spark-optimized capabilities within the platform, which have been generally available since October 2014, are being demonstrated at the Spark Summit Conference, taking place in San Francisco on June 15, 2015.

“The entire enterprise landscape is dramatically shifting with disruptive technologies which are fundamentally changing the cost-to-computational performance ratio,” said Prakash Nanduri, Co-Founder and CEO of Paxata. “A year and a half ago, we recognized how data preparation enabled by Spark could deliver transformational business value with unprecedented economics which is why we made the commitment to develop our entire solution from the ground-up on Apache Spark while doubling down on being part of the Hadoop ecosystem. For the past six months, all of our customers, whether using our solution on-premise or in the Amazon Web Services cloud, have benefitted from that decision with the ability to prepare data interactively in an elastic scale-up-and-out manner at an unprecedented cost-to-performance ratio.”

“Del Monte insists on only adopting technologies which help us achieve the greatest business impact,” said Timothy Weaver, CIO of Del Monte. “There is a big difference between claiming to ‘work on Spark’ and actually delivering a solution designed to fully exploit the processing power available in Spark. We have been impressed with Paxata’s development efforts, as they have delivered a truly optimized self-service data prep solution for our business that scales elastically in cloud environments like AWS, all of which allows us to get greater value from our investment.”

“In 2013, Cloudera made a commitment to provide the leading open source platform for Apache Spark applications. Part of that commitment has been cultivating the industry's largest partner ecosystem around Spark. As Paxata’s technology stack has matured, their adoption and innovations on top of Spark to deliver interactive self-service data prep at scale helps Cloudera customers achieve the greatest value from big data and, ultimately, support the success of information-driven enterprises,” said Charles Zedlewski, VP of Products at Cloudera.

“There is no question that Apache Spark is disrupting the computing paradigm” said Dave Brewster, Co-Founder and CTO at Paxata. “The rapid enhancements driven by the open source community combined with our efforts to advance Spark functionality through ongoing development has paid off with unfathomable scalability and interactive performance of our data preparation Domain-Specific Language (DSL), optimizing compiler, persistent columnar caching, data prep-specific Resilient Distributed Datasets (RDDs) and on line aggregation operators. When combined with our elastic architecture capabilities, our customers achieve a level of scale and performance that can’t be beat.”

Paxata’s platform, which runs on the Cloudera distribution of Hadoop, features a data preparation engine on Spark v1.3, which has been enhanced with the following new capabilities:

On-line aggregations: All aggregates (average, count, first, last, max, min, median, sum, variance standard deviation) are now computed in an on-line fashion, which dramatically reduces the amount of memory required by each individual Spark worker while significantly increasing the responsiveness, performance and scalability of the system.

Enhanced Data-Prep Specific RDDs: Enhanced data preparation specific RDDs for join detection, join execution, clustering, and dynamic filtering continue to extend the computational backend for Paxata’s market-defining IntelliFusion™ capabilities.

Enhanced Persistent Columnar Caching: On each worker node of a Paxata Spark cluster, proprietary on disk data structures allow for probing data without bringing it all into memory. The columnar format is now optimized for both key-based and sequential access on a per column basis, which significantly improves scan efficiency for operations that traverse all values of a column like aggregations and sorts.

Optimizing Compiler: Paxata’s proprietary optimizing compiler has been significantly enhanced to take advantage of the new on-line aggregations, RDDs, and columnar caching to generate highly efficient pipeline transformation plans that minimize the number of columns touched and data that needs to be shuffled across the cluster. The compiler converts scripts into a naïve abstract syntax tree, which is compiled into an optimized logical plan.

For more details, visit the Paxata booth at the Spark Summit 2015 or visit

About Paxata

Paxata delivers the first interactive, self-service Adaptive Data Preparation solution for Enterprise business analysts, data scientists, developers, data curators, and IT teams.  Information-driven organizations use Paxata to accelerate the integration, cleansing, and enrichment of raw data into rich, analytic-ready AnswerSets™ which power ad hoc, operational, predictive and packaged analytics. Paxata partners with industry-leading companies such as Amazon Web Services (AWS) and Cloudera, and seamlessly connects to BI tools, including Salesforce Wave, Tableau, Qlik and Microsoft Excel to greatly accelerate the time to actionable business insights. For more information on pricing and availability, please visit

Visit, follow @Paxata, connect on, follow us at and watch us on



Contact Data