Spark使用整理 - undefined

Recently I read the book <\>, the summation about this book will be wrote down here.

the notion of RDD(resilient distributed dataset)

The RDD is the fundamental abstraction in spark it represents a collection of elements that is

Immutable ( read-only )
Resilient ( fault-tolerant )
Distributed

Immutable : allow xspark to provide important fault-tolerance guarantees in a straightforward manner.

Distributed : machines are transparents to users, so working with RDDs is not much differents from working with a lists, maps and so on.

Resilient: whereas other systems facilitate fault-tolerance by replicating data to multiple machines, RDDs provide falut-tolerant by logging the transformations used to build dataset rather than itself. When fault happens, it just need to repair a subset of dataset.

Basic RDD Operation

there are two types of operations

Transformations
actions

Transformations : like filter and map, perform some useful data manipulation and it will produce a new RDD

actions : like count and foreach, trigger a computations to return a result.

Spark SQL

Dataframe is a basic elements of Spark, similarily with other Dataframe in python or R, it represent a table-like data with named columns and declared column types. The different among them is it’s distributed nature and spark’s catalyst.

Fundamental concepts:

Spark Sql: Consult from Table catalog; Query from Relational DB, Read data from HDFS. Spark Application can using DataFrame DSL to submit spark job, as for non-spark client could connect though JDBC.
Table catalog: It contains information about registered DataFrames and how to access their data.
DataFrame: user reads its data from a table in a relational databases.

Spark sql is supported by Hive. Hive is a distributed warehouse built as a layer of abstraction on the top of Hadoop’s MapReduce. It has its own dialect named HiveQL.

Create DataFrame

We can create DataFrame in these three ways:

Converting from RDDs
- Using RDDs containing row data as tuples
- Using case classes
- Specifying a schema
Running SQL queries
Loading external data

Using RDDs containing row data as tuples : use toDF() method to convert RDD to DataFrame. However, all columns are of type String and nullable, which obviously is a bad solution.

Using case classes : the case class like this:

case class Post(

commentCount:Option[Int],

body:String,

......)

Nullable fields are declared to be of type Option[T].

Specify a schema: This format of schema like this:

val postSchema = StructType(Seq(
    StructField("commentCount", IntegerType, true),
    StructField("id",LongType,false))
)

Then invoke spark.createDataFrame(rowRDD, postSchema) to convert.

DataFrame API Basics

Select:

1 2	val postIdBody = postsDF.select("id","body") val postIds = postsIdBody.drop("body")//drop this column

Filtering:

1	val noAnswer = postsDf.filter(('postTypeId === 1) and ('acceptedAnswerId isNull))