This post is coming a bit later, whoops! After the break, a lot of the focus shifted over to capstone projects, and we hit the ground running. It was hard to get back into the swing of the classroom schedule when we returned- we had to have our preliminary proposals done the first day back, then refine and submit final proposals by the next week. I'll cover more about project things in the next post.
The rest of the curriculum was about big data and making data products. To me, "big" data means data that you can't store/manipulate on computers like your personal laptop. This means you need a different methods of interacting with the data:
- storing in cloud services (S3, EBS in AWS)
- scalable computing (EC2 in AWS)
- parallelized processes (multiprocessing, multithreading)
- multiprocessing: tasks are parallelized across separate processes (separate memory space), good for tasks requiring lots of computational power, may have better throughput due to independent/isolated tasks
- multithreading: tasks are parallelized across threads in the same process (shared memory space), good for lightweight tasks that exchange data between threads, no built-in redundancy to protect against data corruption
- making use of fault tolerant distributed systems that use replication (Hadoop- HDFS)
- mapping functions across distributed systems, then aggregating (reducing) into a desired output (MapReduce)
We started off with writing MapReduce jobs with the
mrjob Python package, then talked about Hadoop before moving on to Spark. For Hadoop, we talked about the Hive & Pig are other formulations of Hadoop.
(Actually, this was my favorite picture from my Google image search, but it's missing the Hive component)
Anyways, these components allow us to write "nicer" MapReduce statements- Hive is a lot like SQL & requires structured data, whereas Pig is more script-like and can handle unstructured data. There are a lot of articles out there for describing the difference between the two- I found the following link useful: http://www.aptibook.com/Articles/Pig-and-hive-advantages-disadvantages-features
Spark is an Apache alternative to the Hadoop MapReduce framework. Instead of single map-reduce jobs, you can write a series of map or reduce functions, and these tasks run faster because the intermediate steps are being handled in memory instead of written to disk (they can consume a lot of memory). In Python, you create something that resembles a data stream that's called an RDD (resilient distributed dataset), then you call map/filter to transform the old RDD to a new RDD, then you can call reduce/combine functions on it. After each step, a new RDD is produced. Kinda reminded me of R's
dplyr package, just doing split/apply/combine.
Lastly, we wrapped up with a little bit of graph theory, data visualization practices & data products (making apps with
Flask). Finding datasets we were interested in for visualizations and data products took a lot more time than we would've liked for each afternoon, but we were happy once we found FiveThirtyEight's datasets :)
Next post... capstone project!