Access aws S3 from Spark-shell

In this example, let’s count the number of records, in S3 bucket by Scala program by using Apache Spark framework.

Required:

  • Spark
  • AWS S3 bucket

val AWS_ACCESS_KEY = “<YOUR_KEY>”
val AWS_SECRET_KEY = “<YOUR_SECRET_KEY>”

sc.hadoopConfiguration.set(“fs.s3n.awsAccessKeyId”, AWS_ACCESS_KEY)
sc.hadoopConfiguration.set(“fs.s3n.awsSecretAccessKey”, AWS_SECRET_KEY)
sc.hadoopConfiguration.set(“fs.s3n.impl”, “org.apache.hadoop.fs.s3native.NativeS3FileSystem”)

val input_file = “s3n://<Bucket_Name>/Path”

val rawdata = sc.textFile(input_file)
val test = rawdata.count

Things to take care.

  1. From AWS S3, you should have getObject access.
  2. “sc” is a Spark Context, It is automatically created by Spark Shell.
  3. Most of issue comes, in permission issue, if “403” error, comes, it is usually a permission error.
  4. Even from command line, you can use aws-cli utility to check, the access.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s