Querying S3 with Presto

This post assumes you have an AWS account and a Presto instance (standalone or cluster) running. We’ll use the Presto CLI to run the queries against the Yelp dataset. The dataset is a JSON dump of a subset of Yelp’s data for businesses, reviews, checkins, users and tips.

Configure Hive metastore

Configure the Hive metastore to point at our data in S3. We are using the docker container inmobi/docker-hive

$ docker pull inmobi/docker-hive
$ docker run -p 9083:9083 -d inmobi/docker-hive

# get the container id
$ docker ps | grep docker-hive | awk '{print $1}'

# exec into the container
$ docker exec -it <container_id> bash

# start Hive
$ ./etc/hive-bootstrap.sh

Modify /usr/local/hadoop/etc/hadoop/core-site.xml and add the following so we can connect to S3:

<property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>your access key</value>
</property>

<property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>your secret key</value>
</property>

Run Hive and CREATE an EXTERNAL TABLE that points to to S3. Note: supply the path to the S3 folder container the .json file. Here, we create a relational-like table out of the JSON, which we will unpack with Presto.

$ hive
hive> CREATE EXTERNAL TABLE yelp_reviews (json_body string)
    stored as textfile
    location "s3://<path to S3 folder containing yelp_academic_dataset_review.json>";

Configure Presto to read from Hive

Specify a properties file for Presto to use to connect to Hive.

hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://<ip of machine hosting container>:9083
hive.s3.connect-timeout=2m
hive.s3.max-backoff-time=10m
hive.s3.max-error-retries=50
hive.metastore-refresh-interval=1m
hive.s3.max-connections=500
hive.s3.max-client-retries=50
hive.s3.socket-timeout=2m
hive.metastore-cache-ttl=20m
hive.s3.staging-directory=/tmp/
hive.s3.use-instance-credentials=true

Save and close this file and distribute it to the catalog folder of the coordinator and all workers. Then restart the coordinator and workers:

$ ./presto-server-0.142/bin/launcher.py restart

Query S3 with Presto

Open the Presto shell on the coordinator:

$ ./presto

Let’s find the reviews with the most “funny” votes in the dataset.

presto> WITH x AS (
    SELECT CAST(json_extract_scalar(json_body, '$.votes.funny') AS BIGINT) AS funny,
    json_extract_scalar(json_body, '$.business_id') AS business_id,
    json_extract_scalar(json_body, '$.text') AS text
    FROM yelp_reviews)
SELECT *
FROM x
ORDER BY funny DESC;

This should give a nice intro to querying S3 and using some of Presto’s tools to work with JSON.