Overview
A four-container application that scrapes the IRIS seismology feed every ten
minutes, persists each event in MongoDB, then ships new rows through Kafka to
a browser that paints them on a D3 world map in real time. The map carries the
last 100 earthquakes worldwide above magnitude 4.0; clicking a marker reveals
location, depth, magnitude, and time. The whole stack — producer, consumer,
Kafka, Zookeeper — comes up under one docker-compose up.
Why I built it
I wanted a working end-to-end change-data-capture pipeline that I'd written every hop of myself, not a tutorial that hides the broker behind a managed service. The interesting questions were the boring ones: what does the producer send when the database already has the row, who's responsible for deduplication, how does the browser stay in sync after a reconnect, what does "real-time" actually mean when the upstream feed only changes every ten minutes. Earthquakes were a convenient payload — public, geospatial, naturally event-shaped — but the project is about the seam between a database and a streaming bus, not about seismology.
How the pipeline works
A Node + Cheerio scraper hits the IRIS event list every ten minutes, parses the HTML table, and upserts new rows into MongoDB through Mongoose. The scrape interval is the slowest hop in the system and sets the upper bound on how fresh the map can be.
A second loop in the producer polls Mongo every ten seconds for rows that haven't been published yet, and pushes each one onto the
earthquakeKafka topic. Splitting scrape from publish means a restart of either side can't lose events — Mongo is the durable source of truth.The consumer subscribes to the Kafka topic, and for each message fans it out to every connected browser through Socket.io. The Kafka → Socket.io bridge is intentionally thin: no business logic, just a protocol translation from "broker topic" to "WebSocket room."
The browser holds a D3 world map and a fixed-length buffer of the last 100 events. Each socket message appends a marker, evicts the oldest, and re-binds the data join — D3's enter/update/exit pattern handles the animation without bespoke code.
Stack and shape
Why a broker
Kafka was the deliberate choice. A single producer feeding a single consumer
doesn't strictly need a broker; a Postgres LISTEN/NOTIFY would carry the
same payload at a fraction of the operational weight. The broker earns its
place the moment a second consumer wants the same stream — an alerter, an
archiver, a second visualization — without the producer learning about it.
Building the simple case on the broker means the not-yet-built second case
is a config change.
Why MongoDB
MongoDB handled the durable side because the upstream payload is naturally document-shaped (loosely-typed seismic event records with optional fields) and Mongoose gave me a schema without forcing migrations during the prototype phase. Express + EJS server-renders the producer's debug page; the consumer's map page is plain HTML with D3 doing the heavy lifting. Webpack bundles the client.
What I'd change
Picking this up again, the work would be on the seams between components — the parts that are easy to ignore when everything is on one machine.
- Idempotent publish. The producer relies on a "sent" flag in Mongo to avoid republishing. A crash between Kafka write and flag update sends the same event twice. The fix is a Kafka transactional producer plus a consumer-side dedup on event ID, not a tighter loop.
- Backfill on reconnect. A browser that drops the socket misses every event during the gap. The server should hold a small ring buffer and replay it on connect, keyed by the latest event ID the client saw.
- Schema for the wire format. Right now the message is whatever Mongoose serializes. A proper Avro or Protobuf schema in a registry would make the consumer survive a producer that adds a field.
- Health and lag metrics. Consumer lag, scrape success rate, time since last event — none of which the current stack exposes. A Prometheus endpoint on each container plus a Grafana dashboard would make the pipeline observable instead of merely working.
Repo
Source on GitHub. GPLv3.
