Sparking off at another thing to learn: Apache Spark

Credit : Jez Timms




What is it actually?

This is trying to say Spark is database agnostic and helps a variety of techniques.
This is a better one showing it is storage agnostic, can work with a variety of management tools (YARN, Mesos and its own built in one) and then can be interacted with through its libraries.


How is it safe? How does it run?

So how do we make an RDD?

  1. If the data is already in a RDD, you can transform it (filter the older RDD to get a new one).
  2. If the data is already within Spark, you can duplicate it (known as parallelize because it allows you to work on the data in parallel to having it stored).
  3. If the data is in something else like HDFS, you can reference it (essentially copying it over but keeping a note of where it has come from).

What if I have more data than I do RAM?

From the Documentation: RDD Persistence

Working with Spark

import org.apache.spark.SparkContext

How to load data

val newData = oldData.filter(conditionsForFiltering)
val data = sc.parallelize(nameOfData)
val data = sc.textFile("nameOfData.txt")

Transformations, Actions and DAGs.

map(func) - go through the data and do this
filter(func) - filter the data to be just this type
flatMap(func) - destroys nesting while doing a map
join(func) - matches key/id values between two datasets and then does a function
reduceByKey(func) - similar to above but will reduce/aggregate the amount of data you have (ie. by adding numbers to a total)
collect() - send all the results back 
count() - count the number of elements
first() - return the first element
take(n) - return the first n amount of elements
foreach(func) - do this func for each element

A note on generating RDDs.

val manipulatedData = data.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)val words = data.flatMap(line => line.split(" "))
val countedWords = => (word,1))
val totalWords = countedWords.reduceByKey(_+_)

Actual Coding

How do I filter for a particular word?

output = logFile.filter(lambda line: "INFO" in line)
def double(x):
return x*2, [1,2,3,4]) #this outputs [2,4,6,8] x: x*2, [1,2,3,4]) #so does this.

How do I do a word count?

  1. label every word with a value of one throughout all the text (these seems an obvious step to a human but a machine doesn’t know the value of words) Hello (1), world (1), some (1), text(1), Hello (1)
  2. then it condenses the words putting any identical words together Hello(1,1), world(1), some (1), text(1)
  3. then it reduces the identical words that are together to get a total Hello(2), world(1), some (1), text(1)
readmeCount = readmeData.flatMap(lambda line: line.split("   ")). \
map(lambda word: (word, 1)). \
reduceByKey(lambda a, b: a + b)
a+b = 2
#then we loop
a+b = 3

Let’s use Spark

Define things

Give it functions

  1. Keep all functions anonymous / only write lambda functions — this is fine if your functions aren’t reused.
  2. Create a list, which the nodes can access, of functions. This is done by creating an object which contains every functions name and its definition. When the nodes execute the code should reference the object and then the functions name it needs to run.
  3. Only send the function its need from the list, no more.

Spark Libraries

Spark Processing Architecture




Focused on saving our time: for your life balance; for our food emissions and for booking airbnbs.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Proxy Network Speed Run

How to Solve Sort the Colors Problem?

Trying Gitlab for first time

10 Free Microsoft Power BI Courses for Beginners in 2022

10 Free Microsoft Power BI Courses for Beginners

Stop Coding What Already Exists

How to Create customs windows hotkey shortcut using python

5 Ways to Use Asterisk in Python

7 Phrases You Should Avoid Using At Your Workplace at Any Cost

Five software developers leaning on a table and looking at a laptop.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Focused on saving our time: for your life balance; for our food emissions and for booking airbnbs.

More from Medium


Big Data Processing: Most Time-Consuming Task

In memory computation in spark

from the above table we can easily understand that 1st offset will process the 24 rows(4+12+08)…