Recently I read the book <\>, the summation about this book will be wrote down here.

the notion of RDD(resilient distributed dataset)

The RDD is the fundamental abstraction in spark it represents a collection of elements that is

  • Immutable ( read-only )
  • Resilient ( fault-tolerant )
  • Distributed

Immutable : allow xspark to provide important fault-tolerance guarantees in a straightforward manner.

Distributed : machines are transparents to users, so working with RDDs is not much differents from working with a lists, maps and so on.

Resilient: whereas other systems facilitate fault-tolerance by replicating data to multiple machines, RDDs provide falut-tolerant by logging the transformations used to build dataset rather than itself. When fault happens, it just need to repair a subset of dataset.

Basic RDD Operation

there are two types of operations

  • Transformations
  • actions

Transformations : like filter and map, perform some useful data manipulation and it will produce a new RDD

actions : like count and foreach, trigger a computations to return a result.

Spark SQL

Dataframe is a basic elements of Spark, similarily with other Dataframe in python or R, it represent a table-like data with named columns and declared column types. The different among them is it’s distributed nature and spark’s catalyst.

Fundamental concepts:

  • Spark Sql: Consult from Table catalog; Query from Relational DB, Read data from HDFS. Spark Application can using DataFrame DSL to submit spark job, as for non-spark client could connect though JDBC.
  • Table catalog: It contains information about registered DataFrames and how to access their data.
  • DataFrame: user reads its data from a table in a relational databases.

Spark sql is supported by Hive. Hive is a distributed warehouse built as a layer of abstraction on the top of Hadoop’s MapReduce. It has its own dialect named HiveQL.

Create DataFrame

We can create DataFrame in these three ways:

  • Converting from RDDs
    • Using RDDs containing row data as tuples
    • Using case classes
    • Specifying a schema
  • Running SQL queries
  • Loading external data

Using RDDs containing row data as tuples : use toDF() method to convert RDD to DataFrame. However, all columns are of type String and nullable, which obviously is a bad solution.

Using case classes : the case class like this:

1
2
3
4
5
6
7
case class Post(

commentCount:Option[Int],

body:String,

......)

Nullable fields are declared to be of type Option[T].

Specify a schema: This format of schema like this:

1
2
3
4
val postSchema = StructType(Seq(
StructField("commentCount", IntegerType, true),
StructField("id",LongType,false))
)

Then invoke spark.createDataFrame(rowRDD, postSchema) to convert.

DataFrame API Basics

Select:

1
2
val postIdBody = postsDF.select("id","body")
val postIds = postsIdBody.drop("body")//drop this column

Filtering:

1
val noAnswer = postsDf.filter(('postTypeId === 1) and ('acceptedAnswerId isNull))