How the sandbox data was loaded: pipelines
Every row in the ociforge sandbox arrived through an Ocient data pipeline. This guide walks through the three real pipelines behind the datasets — they're a compact tour of how Ocient ingests and transforms data, including a couple of genuinely interesting transformations.
Anatomy of a pipeline
An Ocient pipeline is declared with CREATE PIPELINE and has three sections:
CREATE PIPELINE <name>
SOURCE ... -- where the data is (S3, filesystem, Kafka, …)
EXTRACT FORMAT ... -- how to parse it (CSV, delimited, Parquet, JSON, …)
INTO <table> SELECT ... -- transformations mapping fields → columns
You START PIPELINE to run it (in BATCH or CONTINUOUS mode) and watch
progress in sys.pipelines. The transformations in the SELECT are "identical
to functions that already exist in the SQL dialect," and you can dry-run a
pipeline with PREVIEW PIPELINE before loading a single row. The sandbox used
three different source patterns:
1. NOAA weather — straight from public S3
371 million daily weather observations load directly from a public S3 bucket,
with no download and no credentials — Ocient's SOURCE S3 with empty
ACCESS_KEY_ID/SECRET_ACCESS_KEY uses anonymous access:
CREATE PIPELINE pipe_ghcn
SOURCE S3 ENDPOINT 'https://s3.us-east-1.amazonaws.com'
BUCKET 'noaa-ghcn-pds' ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' OBJECT_KEY [ ... ]
EXTRACT FORMAT CSV NUM_HEADER_LINES 1
INTO public.ghcn_daily
SELECT $1 AS station_id,
TO_DATE($2, 'yyyyMMdd', 'java') AS obs_date, -- text 'yyyyMMdd' → DATE
$3 AS element,
INT($4) AS value, ...;
The interesting bit is TO_DATE($2,'yyyyMMdd','java'): the source dates are bare
yyyyMMdd strings, parsed to real DATE values (which then back the table's
TimeKey) using a Java-style format pattern.
2. GDELT events — tab-delimited, with geospatial construction
The 61-million-row GDELT feed also streams directly from public S3
(gdelt-open-data), but it's a 57-column tab-separated format with no
header — so the extract spells out the delimiters, and the transform does real
work:
EXTRACT FORMAT delimited RECORD_DELIMITER e'\n' FIELD_DELIMITER e'\t' NUM_HEADER_LINES 0
INTO public.gdelt_events
SELECT BIGINT($1) AS global_event_id,
TO_DATE($2, 'yyyyMMdd', 'java') AS event_date,
...
IF($54='', NULL, DOUBLE($54)) AS action_geo_lat, -- empty string → NULL
IF($55='', NULL, DOUBLE($55)) AS action_geo_long,
IF($54='' OR $55='', NULL,
ST_POINT(DOUBLE($55), DOUBLE($54))) AS action_geo_point; -- build a POINT
Two things worth calling out:
- NULL handling — many events have no coordinates, so empty fields become
NULLinstead of failing the load (IF($54='', NULL, …)). - Geospatial construction at load time —
ST_POINT(longitude, latitude)turns two text columns into a geospatialPOINT, so events are queryable with spatial predicates the moment they land (see the geospatial example).
3. NYC taxi — download once, then load from the filesystem
The 119-million-row taxi data is published as Parquet, so the sandbox
downloads the files once to a persistent volume and then loads them with
SOURCE FILESYSTEM:
CREATE PIPELINE pipe_taxi_lower
SOURCE FILESYSTEM PREFIX '/mnt/ociforge-data/taxi/lower/' FILTER '*.parquet'
EXTRACT FORMAT parquet
INTO public.yellow_taxi
SELECT INT($"VendorID") AS vendor_id,
$"tpep_pickup_datetime" AS pickup_ts,
$"tpep_dropoff_datetime" AS dropoff_ts, ...;
Because Parquet is self-describing, fields are referenced by name
($"tpep_pickup_datetime") rather than by position — and the transform renames
the source's CamelCase columns to tidy snake_case while casting types. There's
also a real-world wrinkle: older taxi files don't have the airport_fee column
that newer ones do, so the data is loaded as two file generations with
slightly different SELECT lists into the same table.
Bonus: turning a shapefile into queryable polygons
The taxi-zone boundaries start life as an Esri shapefile in NY State Plane
coordinates. A pre-processing step uses ogr2ogr to reproject it to WGS84
(EPSG:2263 → EPSG:4326) and explode multipolygons into single polygons
(Ocient geometry is POINT/LINESTRING/POLYGON), then the pipeline loads the
WKT text:
SELECT INT($2) AS location_id,
ST_FORCECCW(ST_POLYGONFROMTEXT($1)) AS geom;
ST_POLYGONFROMTEXT parses the WKT into a POLYGON, and ST_FORCECCW fixes the
winding order — polygon orientation matters for spatial containment tests.
Things to remember
SOURCE S3with empty keys = anonymous access to public buckets; no download.- Transformations are just SQL functions — dates (
TO_DATE), casts (INT/BIGINT/DOUBLE), conditionals (IF), and geospatial constructors (ST_POINT,ST_POLYGONFROMTEXT,ST_FORCECCW). - Parquet uses named field references; delimited/CSV use positional
(
$1,$2, …). - Use
PREVIEW PIPELINEto iterate on a transform before loading, and a pipeline can even fan one sourceINTOmultiple tables withWHERErouting.
For the full syntax see the Data Pipelines docs, and the datasets page for the resulting schemas.