Big Data | Abstract Content Factory

Big Data Schema Evolution

Posted February 26, 2018 by Dan Osipov & filed under Big Data.

Schema migrations in the relational world are now common practice. The best practices for evolving a database schema are well known, where a migration gets applied before the code that needs to use it is rolled out. If there are any problems, migration can be rolled back. The same practices are not as well established… Read more »

Type All The Things!

Posted August 24, 2016 by Dan Osipov & filed under Big Data, Programming.

Scala is statically typed. Yet we rarely use it to our advantage. For example let’s say you have a function with the signature: def convert(a: Double, b: Double): String Of course, the author hasn’t written a comment, and clearly didn’t expressively label the variables. So lets fix it: def reverseGeocode(latitude: Double, longitude: Double): String The… Read more »

Spark on AWS EMR – The Missing Manual

Posted July 1, 2015 by Dan Osipov & filed under Big Data.

Apache Spark recently received top level support on Amazon Elastic MapReduce (EMR) cloud offering, joining applications such as Hadoop, Hive, Pig, HBase, Presto, and Impala. This is exciting for me, because most of my workloads run on EMR, and utilizing Spark required either standing up manual EC2 clusters, or using EMR bootstrap, which was very… Read more »

Handling Avro records in Scalding

Posted May 27, 2015 by Dan Osipov & filed under Big Data, Programming.

In this post I’ll try to cover how to write and read Avro records in Scalding pipelines. To begin, a reminder that Avro is a serialization format, and Scalding is a scala API on top of Hadoop. If you’re not using Scalding, this post is probably not too interesting for you. Let’s begin by defining… Read more »

Debugging Apache Spark Jobs

Posted February 19, 2015 by Dan Osipov & filed under Big Data.

Would you like to step through your Spark job in a debugger? These steps show you how to configure IntelliJ IDEA to allow just that. Unlike a traditional Java or Scala application, Spark jobs expect to be run within a larger Spark application, that gives access to SparkContext. Your application interacts with the environment through… Read more »

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Posts Categorized: Big Data