Interesting queries for the ociforge Ocient sandbox
The ociforge sandbox holds ~550 million rows of public data across three datasets — NYC yellow-taxi trips, NOAA daily weather, and GDELT world events. This guide is a tour of queries that show off both the data and what the Ocient database does well: large aggregations, joins, time-series pruning, geospatial predicates, and approximate analytics.
Everything here is read-only SELECT. Connect first — sign in to the
dashboard for the web console, or use a token / device-grant from
the SSO guide. Tables live in the sandbox
database under the public schema, so fully-qualify them as
sandbox.public.<table> (or SET SCHEMA). Browse every table and column on the
datasets page.
NYC yellow taxi — 119M trips
Trips and average fare by passenger count
SELECT passenger_count,
count(*) AS trips,
round(avg(fare_amount),2) AS avg_fare
FROM sandbox.public.yellow_taxi
GROUP BY passenger_count
ORDER BY trips DESC
LIMIT 10;
Single-passenger trips dominate (~84.7M, avg fare ≈ $17.17). Watch for a bucket
of NULL passenger rows with a negative average fare — a nice reminder that
real-world data needs cleaning (those are refunds/voided trips).
Busiest pickup zones (a join)
yellow_taxi stores TLC zone IDs; join taxi_zones to get human names.
SELECT z.borough,
z.zone_name,
count(*) AS pickups
FROM sandbox.public.yellow_taxi t
JOIN sandbox.public.taxi_zones z ON t.pu_location_id = z.location_id
GROUP BY z.borough, z.zone_name
ORDER BY pickups DESC
LIMIT 10;
When do people ride? (time-series pruning)
pickup_ts is a TimeKey (bucketed by day), so range filters on it let
Ocient prune storage instead of scanning everything.
SELECT EXTRACT(HOUR FROM pickup_ts) AS hour_of_day,
count(*) AS trips
FROM sandbox.public.yellow_taxi
WHERE pickup_ts >= TIMESTAMP '2024-01-01 00:00:00'
AND pickup_ts < TIMESTAMP '2025-01-01 00:00:00'
GROUP BY hour_of_day
ORDER BY hour_of_day;
Tipping by payment type
SELECT payment_type,
round(avg(tip_amount),2) AS avg_tip,
round(avg(tip_amount / NULLIF(fare_amount,0))*100, 1) AS avg_tip_pct
FROM sandbox.public.yellow_taxi
WHERE fare_amount > 0
GROUP BY payment_type
ORDER BY payment_type;
Tips show up almost entirely on payment_type = 1 (credit card) — cash tips
aren't captured by the meter, a classic data-quality footnote.
Geospatial — which taxi zone is a point in?
The taxi zones ship as POLYGON geometry (taxi_zone_shapes.geom, WGS84). Use
ST_CONTAINS with a ST_POINT(longitude, latitude) to do a point-in-polygon
lookup — e.g. which zone covers Times Square (≈ -73.9855, 40.7580):
SELECT s.location_id,
z.zone_name,
z.borough
FROM sandbox.public.taxi_zone_shapes s
JOIN sandbox.public.taxi_zones z ON s.location_id = z.location_id
WHERE ST_CONTAINS(s.geom, ST_POINT(-73.9855, 40.7580));
GDELT carries geospatial points too (gdelt_events.action_geo_point), so you
can run the same kind of spatial predicate over world events.
NOAA GHCN-Daily — 371M weather observations
Which measurements are most common?
SELECT element,
count(*) AS observations
FROM sandbox.public.ghcn_daily
GROUP BY element
ORDER BY observations DESC
LIMIT 10;
Precipitation (PRCP, ~109M), snowfall (SNOW, ~48M) and minimum temperature
(TMIN, ~45M) lead.
Hottest readings on record
TMAX is stored in tenths of a degree Celsius, and q_flag is NULL when a
value passed quality control (non-null values are failure codes):
SELECT station_id,
obs_date,
value / 10.0 AS tmax_celsius
FROM sandbox.public.ghcn_daily
WHERE element = 'TMAX'
AND q_flag IS NULL
ORDER BY value DESC
LIMIT 10;
Approximate distinct stations (fast on 371M rows)
Exact COUNT(DISTINCT …) over hundreds of millions of rows is expensive;
APPROX_COUNT_DISTINCT (HyperLogLog) answers in a fraction of the time:
SELECT APPROX_COUNT_DISTINCT(station_id) AS approx_stations
FROM sandbox.public.ghcn_daily;
GDELT events — 61M world events
Most active actors
SELECT actor1_name,
count(*) AS events
FROM sandbox.public.gdelt_events
WHERE actor1_name IS NOT NULL
GROUP BY actor1_name
ORDER BY events DESC
LIMIT 10;
The United States, Russia, and "President" top the list — GDELT's actor names are extracted from news text, so generic role words appear alongside countries.
Tone and stability over time
avg_tone runs -100 (negative) to +100, and goldstein_scale rates each
event's impact on stability from -10 to +10:
SELECT year,
round(avg(avg_tone),2) AS mean_tone,
round(avg(goldstein_scale),2) AS mean_goldstein,
count(*) AS events
FROM sandbox.public.gdelt_events
GROUP BY year
ORDER BY year;
A window function: cumulative event volume
SELECT year,
count(*) AS events,
sum(count(*)) OVER (ORDER BY year) AS cumulative_events
FROM sandbox.public.gdelt_events
GROUP BY year
ORDER BY year;
Tips
- Fully-qualify tables as
sandbox.public.<table>, or runSET SCHEMA publicafter connecting. - Each fact table has a TimeKey (
yellow_taxi.pickup_ts,ghcn_daily.obs_date,gdelt_events.event_date) — filtering on it prunes storage and speeds queries. - Reach for
APPROX_COUNT_DISTINCTand other approximate aggregates when an exact answer isn't worth the scan. - Access is read-only —
SELECTonly. Add aLIMITwhile you explore.
Browse the full schemas on the datasets page, and see the SSO guide for connecting non-interactively.