ociforge

← How-to guides

How the sandbox data was loaded: pipelines

Every row in the ociforge sandbox arrived through an Ocient data pipeline. This guide walks through the three real pipelines behind the datasets — they're a compact tour of how Ocient ingests and transforms data, including a couple of genuinely interesting transformations.

Anatomy of a pipeline

An Ocient pipeline is declared with CREATE PIPELINE and has three sections:

CREATE PIPELINE <name>
  SOURCE   ...            -- where the data is (S3, filesystem, Kafka, …)
  EXTRACT  FORMAT ...     -- how to parse it (CSV, delimited, Parquet, JSON, …)
INTO <table> SELECT ...   -- transformations mapping fields → columns

You START PIPELINE to run it (in BATCH or CONTINUOUS mode) and watch progress in sys.pipelines. The transformations in the SELECT are "identical to functions that already exist in the SQL dialect," and you can dry-run a pipeline with PREVIEW PIPELINE before loading a single row. The sandbox used three different source patterns:

1. NOAA weather — straight from public S3

371 million daily weather observations load directly from a public S3 bucket, with no download and no credentials — Ocient's SOURCE S3 with empty ACCESS_KEY_ID/SECRET_ACCESS_KEY uses anonymous access:

CREATE PIPELINE pipe_ghcn
  SOURCE S3 ENDPOINT 'https://s3.us-east-1.amazonaws.com'
            BUCKET 'noaa-ghcn-pds' ACCESS_KEY_ID '' SECRET_ACCESS_KEY '' OBJECT_KEY [ ... ]
  EXTRACT FORMAT CSV NUM_HEADER_LINES 1
INTO public.ghcn_daily
SELECT $1 AS station_id,
       TO_DATE($2, 'yyyyMMdd', 'java') AS obs_date,   -- text 'yyyyMMdd' → DATE
       $3 AS element,
       INT($4) AS value, ...;

The interesting bit is TO_DATE($2,'yyyyMMdd','java'): the source dates are bare yyyyMMdd strings, parsed to real DATE values (which then back the table's TimeKey) using a Java-style format pattern.

2. GDELT events — tab-delimited, with geospatial construction

The 61-million-row GDELT feed also streams directly from public S3 (gdelt-open-data), but it's a 57-column tab-separated format with no header — so the extract spells out the delimiters, and the transform does real work:

  EXTRACT FORMAT delimited RECORD_DELIMITER e'\n' FIELD_DELIMITER e'\t' NUM_HEADER_LINES 0
INTO public.gdelt_events
SELECT BIGINT($1) AS global_event_id,
       TO_DATE($2, 'yyyyMMdd', 'java') AS event_date,
       ...
       IF($54='', NULL, DOUBLE($54)) AS action_geo_lat,     -- empty string → NULL
       IF($55='', NULL, DOUBLE($55)) AS action_geo_long,
       IF($54='' OR $55='', NULL,
          ST_POINT(DOUBLE($55), DOUBLE($54))) AS action_geo_point;  -- build a POINT

Two things worth calling out:

3. NYC taxi — download once, then load from the filesystem

The 119-million-row taxi data is published as Parquet, so the sandbox downloads the files once to a persistent volume and then loads them with SOURCE FILESYSTEM:

CREATE PIPELINE pipe_taxi_lower
  SOURCE FILESYSTEM PREFIX '/mnt/ociforge-data/taxi/lower/' FILTER '*.parquet'
  EXTRACT FORMAT parquet
INTO public.yellow_taxi
SELECT INT($"VendorID")            AS vendor_id,
       $"tpep_pickup_datetime"     AS pickup_ts,
       $"tpep_dropoff_datetime"    AS dropoff_ts, ...;

Because Parquet is self-describing, fields are referenced by name ($"tpep_pickup_datetime") rather than by position — and the transform renames the source's CamelCase columns to tidy snake_case while casting types. There's also a real-world wrinkle: older taxi files don't have the airport_fee column that newer ones do, so the data is loaded as two file generations with slightly different SELECT lists into the same table.

Bonus: turning a shapefile into queryable polygons

The taxi-zone boundaries start life as an Esri shapefile in NY State Plane coordinates. A pre-processing step uses ogr2ogr to reproject it to WGS84 (EPSG:2263EPSG:4326) and explode multipolygons into single polygons (Ocient geometry is POINT/LINESTRING/POLYGON), then the pipeline loads the WKT text:

SELECT INT($2) AS location_id,
       ST_FORCECCW(ST_POLYGONFROMTEXT($1)) AS geom;

ST_POLYGONFROMTEXT parses the WKT into a POLYGON, and ST_FORCECCW fixes the winding order — polygon orientation matters for spatial containment tests.

Things to remember

For the full syntax see the Data Pipelines docs, and the datasets page for the resulting schemas.