ociforge

← How-to guides

Interesting queries for the ociforge Ocient sandbox

The ociforge sandbox holds ~550 million rows of public data across three datasets — NYC yellow-taxi trips, NOAA daily weather, and GDELT world events. This guide is a tour of queries that show off both the data and what the Ocient database does well: large aggregations, joins, time-series pruning, geospatial predicates, and approximate analytics.

Everything here is read-only SELECT. Connect first — sign in to the dashboard for the web console, or use a token / device-grant from the SSO guide. Tables live in the sandbox database under the public schema, so fully-qualify them as sandbox.public.<table> (or SET SCHEMA). Browse every table and column on the datasets page.


NYC yellow taxi — 119M trips

Trips and average fare by passenger count

SELECT passenger_count,
       count(*)                  AS trips,
       round(avg(fare_amount),2) AS avg_fare
FROM sandbox.public.yellow_taxi
GROUP BY passenger_count
ORDER BY trips DESC
LIMIT 10;

Single-passenger trips dominate (~84.7M, avg fare ≈ $17.17). Watch for a bucket of NULL passenger rows with a negative average fare — a nice reminder that real-world data needs cleaning (those are refunds/voided trips).

Busiest pickup zones (a join)

yellow_taxi stores TLC zone IDs; join taxi_zones to get human names.

SELECT z.borough,
       z.zone_name,
       count(*) AS pickups
FROM sandbox.public.yellow_taxi t
JOIN sandbox.public.taxi_zones  z ON t.pu_location_id = z.location_id
GROUP BY z.borough, z.zone_name
ORDER BY pickups DESC
LIMIT 10;

When do people ride? (time-series pruning)

pickup_ts is a TimeKey (bucketed by day), so range filters on it let Ocient prune storage instead of scanning everything.

SELECT EXTRACT(HOUR FROM pickup_ts) AS hour_of_day,
       count(*)                     AS trips
FROM sandbox.public.yellow_taxi
WHERE pickup_ts >= TIMESTAMP '2024-01-01 00:00:00'
  AND pickup_ts <  TIMESTAMP '2025-01-01 00:00:00'
GROUP BY hour_of_day
ORDER BY hour_of_day;

Tipping by payment type

SELECT payment_type,
       round(avg(tip_amount),2)                              AS avg_tip,
       round(avg(tip_amount / NULLIF(fare_amount,0))*100, 1) AS avg_tip_pct
FROM sandbox.public.yellow_taxi
WHERE fare_amount > 0
GROUP BY payment_type
ORDER BY payment_type;

Tips show up almost entirely on payment_type = 1 (credit card) — cash tips aren't captured by the meter, a classic data-quality footnote.


Geospatial — which taxi zone is a point in?

The taxi zones ship as POLYGON geometry (taxi_zone_shapes.geom, WGS84). Use ST_CONTAINS with a ST_POINT(longitude, latitude) to do a point-in-polygon lookup — e.g. which zone covers Times Square (≈ -73.9855, 40.7580):

SELECT s.location_id,
       z.zone_name,
       z.borough
FROM sandbox.public.taxi_zone_shapes s
JOIN sandbox.public.taxi_zones       z ON s.location_id = z.location_id
WHERE ST_CONTAINS(s.geom, ST_POINT(-73.9855, 40.7580));

GDELT carries geospatial points too (gdelt_events.action_geo_point), so you can run the same kind of spatial predicate over world events.


NOAA GHCN-Daily — 371M weather observations

Which measurements are most common?

SELECT element,
       count(*) AS observations
FROM sandbox.public.ghcn_daily
GROUP BY element
ORDER BY observations DESC
LIMIT 10;

Precipitation (PRCP, ~109M), snowfall (SNOW, ~48M) and minimum temperature (TMIN, ~45M) lead.

Hottest readings on record

TMAX is stored in tenths of a degree Celsius, and q_flag is NULL when a value passed quality control (non-null values are failure codes):

SELECT station_id,
       obs_date,
       value / 10.0 AS tmax_celsius
FROM sandbox.public.ghcn_daily
WHERE element = 'TMAX'
  AND q_flag IS NULL
ORDER BY value DESC
LIMIT 10;

Approximate distinct stations (fast on 371M rows)

Exact COUNT(DISTINCT …) over hundreds of millions of rows is expensive; APPROX_COUNT_DISTINCT (HyperLogLog) answers in a fraction of the time:

SELECT APPROX_COUNT_DISTINCT(station_id) AS approx_stations
FROM sandbox.public.ghcn_daily;

GDELT events — 61M world events

Most active actors

SELECT actor1_name,
       count(*) AS events
FROM sandbox.public.gdelt_events
WHERE actor1_name IS NOT NULL
GROUP BY actor1_name
ORDER BY events DESC
LIMIT 10;

The United States, Russia, and "President" top the list — GDELT's actor names are extracted from news text, so generic role words appear alongside countries.

Tone and stability over time

avg_tone runs -100 (negative) to +100, and goldstein_scale rates each event's impact on stability from -10 to +10:

SELECT year,
       round(avg(avg_tone),2)       AS mean_tone,
       round(avg(goldstein_scale),2) AS mean_goldstein,
       count(*)                      AS events
FROM sandbox.public.gdelt_events
GROUP BY year
ORDER BY year;

A window function: cumulative event volume

SELECT year,
       count(*)                              AS events,
       sum(count(*)) OVER (ORDER BY year)    AS cumulative_events
FROM sandbox.public.gdelt_events
GROUP BY year
ORDER BY year;

Tips

Browse the full schemas on the datasets page, and see the SSO guide for connecting non-interactively.