Set Operations On HLLs of Different Sizes


Here at AK, we’re in the business of storing huge amounts of information in the form of 64 bit keys. As shown in otherblogposts and in the HLL post by Matt, one efficient way of getting an estimate of the size of the set of these keys is by using the HyperLogLog (HLL) algorithm.  There are two important decisions one has to make when implementing this algorithm.  The first is how many bins one will use and the second is the maximum value one allows in each bin.  As a result, the amount of space this will take up is going to be the number of bins times the log of the maximum value you allow in each bin.  For this post we’ll ignore this second consideration and focus instead on the number of bins one uses.  The accuracy for an estimate is given approximately by 1.04/√b

View original post 889 more words

Probabilistic Data Structures for Web Analytics and Data Mining

Highly Scalable Blog

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters.  Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other…

View original post 4,351 more words

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 1)

Kafka : Getting Started


These days you hear a lot about “stream processing”, “event data”, and “real-time”, often related to technologies like Kafka, Storm, Samza, or Spark’s Streaming module. Though there is a lot of excitement, not everyone knows how to fit these technologies into their technology stack or how to put it to use in practical applications.

This guide is going to discuss our experience with real-time data streams: how to build a home for real-time data within your company, and how to build applications that make use of that data. All of this is based on real experience: we spent the last five years building Apache Kafka, transitioning LinkedIn to a fully stream-based architecture, and helping a number of Silicon Valley tech companies do the same thing.

The first part of the guide will give a high-level overview of what we came to call a “stream data platform”: a…

View original post 4,180 more words

Access aws S3 from Spark-shell

In this example, let’s count the number of records, in S3 bucket by Scala program by using Apache Spark framework.


  • Spark
  • AWS S3 bucket


sc.hadoopConfiguration.set(“fs.s3n.awsAccessKeyId”, AWS_ACCESS_KEY)
sc.hadoopConfiguration.set(“fs.s3n.awsSecretAccessKey”, AWS_SECRET_KEY)
sc.hadoopConfiguration.set(“fs.s3n.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)

val input_file = “s3n://<Bucket_Name>/Path”

val rawdata = sc.textFile(input_file)
val test = rawdata.count

Things to take care.

  1. From AWS S3, you should have getObject access.
  2. “sc” is a Spark Context, It is automatically created by Spark Shell.
  3. Most of issue comes, in permission issue, if “403” error, comes, it is usually a permission error.
  4. Even from command line, you can use aws-cli utility to check, the access.

Playing with the Mahout recommendation engine on a Hadoop cluster

A really working tutorial on “How to use Mahout recommendation”


Elephant and riderApache Mahout is an open source library which implements several scalable machine learning algorithms. They can be used among other things to categorize data, group items by cluster, and to implement a recommendation engine.

In this tutorial we will run the Mahout recommendation engine on a data set of movie ratings and show the movie recommendations for each user.

For more details on the recommendation algorithm, you can look at the tutorial from Jee Vang.


  • Java (to run hadoop)
  • Hadoop (used by Mahout)
  • Mahout
  • Python (use to show the result)

Running Hadoop

In this section, we are going to describe how to quickly install and configure hadoop on a single machine.

View original post 876 more words