Learning Path: Data Science With Apache Spark 2
Get started with Spark for large-scale distributed data processing and data science
Description
The real power and value proposition of Apache Spark is its speed and platform to execute data processing and data science tasks. Sounds interesting? Let’s see how easy it is!
Packt’s Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it.
Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists. Spark's unique use case is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations to allow data scientists to tackle the complexities that come with raw unstructured datasets.
This Learning Path starts with an introduction tour of Apache Spark 2. We will look at the basics of Spark, introduce SparkR, then look at the charting and plotting features of Python in conjunction with Spark data processing, and finally take a thorough look at Spark's data processing libraries. We then develop a real-world Spark application. Next, we will help you become comfortable and confident working with Spark for data science by exploring Spark’s data science libraries on a dataset of tweets.
The goal of this course to introduce you to Apache Spark 2 and teach you its data processing and data science libraries so that you are equipped with the skills required from modern data scientists.
This Learning Path is authored by some of the best in their fields.
Rajanarayanan Thottuvaikkatumana
Rajanarayanan Thottuvaikkatumana, or Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Currently he is building a next generation Hadoop YARN-based data processing platform and an application suite built with Spark using Scala.
Eric Charles
Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer, a social network for Data Scientists. His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing. He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events.
What You Will Learn!
- Get to know the fundamentals of Spark 2.0 and the Spark programming model using Scala and Python
- Know how to use Spark SQL and DataFrames using Scala and Python
- Get an introduction to Spark programming using R
- Develop a complete Spark application
- Obtain and clean data before processing it
- Understand the Spark machine learning algorithm to build a simple pipeline
- Work with interactive visualization packages in Spark
- Apply data mining techniques on the available datasets
- Build a recommendation engine
Who Should Attend!
- Application developers, data scientists, or big data architects interested in combining the data processing power of Apache Spark will find this course to be very useful. As implementations of Apache Spark will be shown with Scala and Python, some programming knowledge on these languages will be needed. This course is for anyone who wants to work with Spark on large and complex datasets. A basic knowledge about statistics and computational mathematics is expected.
- With the help of real-world use cases on the main features of Spark, this course offers an easy introduction to the framework. This practical hands-on course covers the fundamentals of Spark needed to get to grips with data science through a single dataset. It expands on the next learning curve for those comfortable with Spark programming who are looking to apply Spark in the field of data science.