Recently I read the book <\
the notion of RDD(resilient distributed dataset)
The RDD is the fundamental abstraction in spark it represents a collection of elements that is
- Immutable ( read-only )
- Resilient ( fault-tolerant )
- Distributed
Immutable : allow xspark to provide important fault-tolerance guarantees in a straightforward manner.
Distributed : machines are transparents to users, so working with RDDs is not much differents from working with a lists, maps and so on.
Resilient: whereas other systems facilitate fault-tolerance by replicating data to multiple machines, RDDs provide falut-tolerant by logging the transformations used to build dataset rather than itself. When fault happens, it just need to repair a subset of dataset.
Basic RDD Operation
there are two types of operations
- Transformations
- actions
Transformations : like filter and map, perform some useful data manipulation and it will produce a new RDD
actions : like count and foreach, trigger a computations to return a result.
Spark SQL
Dataframe is a basic elements of Spark, similarily with other Dataframe in python or R, it represent a table-like data with named columns and declared column types. The different among them is it’s distributed nature and spark’s catalyst.
Fundamental concepts:
- Spark Sql: Consult from Table catalog; Query from Relational DB, Read data from HDFS. Spark Application can using DataFrame DSL to submit spark job, as for non-spark client could connect though JDBC.
- Table catalog: It contains information about registered DataFrames and how to access their data.
- DataFrame: user reads its data from a table in a relational databases.
Spark sql is supported by Hive. Hive is a distributed warehouse built as a layer of abstraction on the top of Hadoop’s MapReduce. It has its own dialect named HiveQL.
Create DataFrame
We can create DataFrame in these three ways:
- Converting from RDDs
- Using RDDs containing row data as tuples
- Using case classes
- Specifying a schema
- Running SQL queries
- Loading external data
Using RDDs containing row data as tuples : use toDF() method to convert RDD to DataFrame. However, all columns are of type String and nullable, which obviously is a bad solution.
Using case classes : the case class like this:
1 | case class Post( |
Nullable fields are declared to be of type Option[T].
Specify a schema: This format of schema like this:
1 | val postSchema = StructType(Seq( |
Then invoke spark.createDataFrame(rowRDD, postSchema) to convert.
DataFrame API Basics
Select:
1 | val postIdBody = postsDF.select("id","body") |
Filtering:
1 | val noAnswer = postsDf.filter(('postTypeId === 1) and ('acceptedAnswerId isNull)) |