Querying S3 with Presto
This post assumes you have an AWS account and a Presto instance (standalone or cluster) running. We’ll use the Presto CLI to run the queries against the Yelp dataset. The dataset is a JSON dump of a subset of Yelp’s data for businesses, reviews, checkins, users and tips.
Configure Hive metastore#
Configure the Hive metastore to point at our data in S3. We are using the docker container inmobi/docker-hive
$ docker pull inmobi/docker-hive
$ docker run -p 9083:9083 -d inmobi/docker-hive
# get the container id
$ docker ps | grep docker-hive | awk '{print $1}'
# exec into the container
$ docker exec -it <container_id> bash
# start Hive
$ ./etc/hive-bootstrap.sh
Modify /usr/local/hadoop/etc/hadoop/core-site.xml
and add the following so we can connect to S3:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>your access key</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>your secret key</value>
</property>
Run Hive and CREATE
an EXTERNAL TABLE
that points to to S3. Note: supply the path to the S3 folder container the .json
file. Here, we create a relational-like table out of the JSON, which we will unpack with Presto.
$ hive
hive> CREATE EXTERNAL TABLE yelp_reviews (json_body string)
stored as textfile
location "s3://<path to S3 folder containing yelp_academic_dataset_review.json>";
Configure Presto to read from Hive#
Specify a properties
file for Presto to use to connect to Hive.
hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://<ip of machine hosting container>:9083
hive.s3.connect-timeout=2m
hive.s3.max-backoff-time=10m
hive.s3.max-error-retries=50
hive.metastore-refresh-interval=1m
hive.s3.max-connections=500
hive.s3.max-client-retries=50
hive.s3.socket-timeout=2m
hive.metastore-cache-ttl=20m
hive.s3.staging-directory=/tmp/
hive.s3.use-instance-credentials=true
Save and close this file and distribute it to the catalog
folder of the coordinator and all workers. Then restart the coordinator and workers:
$ ./presto-server-0.142/bin/launcher.py restart
Query S3 with Presto#
Open the Presto shell on the coordinator:
$ ./presto
Let’s find the reviews with the most “funny” votes in the dataset.
presto> WITH x AS (
SELECT CAST(json_extract_scalar(json_body, '$.votes.funny') AS BIGINT) AS funny,
json_extract_scalar(json_body, '$.business_id') AS business_id,
json_extract_scalar(json_body, '$.text') AS text
FROM yelp_reviews)
SELECT *
FROM x
ORDER BY funny DESC;
This should give a nice intro to querying S3 and using some of Presto’s tools to work with JSON.