Querying S3 with Presto

This post assumes you have an AWS account and a Presto instance (standalone or cluster) running. We’ll use the Presto CLI to run the queries against the Yelp dataset. The dataset is a JSON dump of a subset of Yelp’s data for businesses, reviews, checkins, users and tips.

Configure Hive metastore#

Configure the Hive metastore to point at our data in S3. We are using the docker container inmobi/docker-hive

$ docker pull inmobi/docker-hive
$ docker run -p 9083:9083 -d inmobi/docker-hive

# get the container id
$ docker ps | grep docker-hive | awk '{print $1}'

# exec into the container
$ docker exec -it <container_id> bash

# start Hive
$ ./etc/hive-bootstrap.sh

Modify /usr/local/hadoop/etc/hadoop/core-site.xml and add the following so we can connect to S3:

<property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>your access key</value>
</property>

<property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>your secret key</value>
</property>

Run Hive and CREATE an EXTERNAL TABLE that points to to S3. Note: supply the path to the S3 folder container the .json file. Here, we create a relational-like table out of the JSON, which we will unpack with Presto.

$ hive
hive> CREATE EXTERNAL TABLE yelp_reviews (json_body string)
    stored as textfile
    location "s3://<path to S3 folder containing yelp_academic_dataset_review.json>";

Configure Presto to read from Hive#

Specify a properties file for Presto to use to connect to Hive.

hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://<ip of machine hosting container>:9083
hive.s3.connect-timeout=2m
hive.s3.max-backoff-time=10m
hive.s3.max-error-retries=50
hive.metastore-refresh-interval=1m
hive.s3.max-connections=500
hive.s3.max-client-retries=50
hive.s3.socket-timeout=2m
hive.metastore-cache-ttl=20m
hive.s3.staging-directory=/tmp/
hive.s3.use-instance-credentials=true

Save and close this file and distribute it to the catalog folder of the coordinator and all workers. Then restart the coordinator and workers:

$ ./presto-server-0.142/bin/launcher.py restart

Query S3 with Presto#

Open the Presto shell on the coordinator:

$ ./presto

Let’s find the reviews with the most “funny” votes in the dataset.

presto> WITH x AS (
    SELECT CAST(json_extract_scalar(json_body, '$.votes.funny') AS BIGINT) AS funny,
    json_extract_scalar(json_body, '$.business_id') AS business_id,
    json_extract_scalar(json_body, '$.text') AS text
    FROM yelp_reviews)
SELECT *
FROM x
ORDER BY funny DESC;

This should give a nice intro to querying S3 and using some of Presto’s tools to work with JSON.