Hacker News

4 hours ago by georgewfraser

This is a needlessly complex solution. You will get better performance, with simpler maintenance, by replicating everything into an appropriate analytical database (Snowflake and BugQuery are both good choices). Setting up multiple Postgres database and linking them together with foreign data wrappers is interesting to blog about, but it’s an extremely roundabout way to solve this problem.

3 hours ago by chatmasta

It seems difficult to make the argument that replicating data to a warehouse is "simpler" than leaving the data in its original place (which could actually be a warehouse) and querying it through an FDW. In your scenario you need to maintain and monitor an ETL pipeline, pay storage costs for the warehouse, and likely pay bandwidth costs for moving your data into it. In the blog post scenario, the only cost is a Postgres instance and bandwidth for the data your queries return.

There are many cases where replication to a warehouse makes sense, but there are also plenty where it doesn't. Our philosophy at Splitgraph, where we're building something just like this blog post, is that querying should be cheap and easy in the experimentation phase. Analysts shouldn't need to ask engineers to setup a data pipeline just so they can query some data they might not even want. So why not start with a "data lake" (or "data mesh")? You can query any data connected to the mesh without any setup or waiting time. If you eventually decide that you don't want the tradeoffs of federation, then you can selectively warehouse only the data you need to optimize certain queries.

(I might go even further and suggest that the data industry has the idea of the "modern data stack" completely wrong. Replicating data to a warehouse is the ultimate act of centralization. Every other aspect of software is trending toward decentralization; it's the only real way to deal with increasing complexity and sprawl. It's only a matter of time before this paradigm shift hits the data stack too.)

3 hours ago by nunie123

I agree with much of what you said.

However, I'm disinclined to believe there will be a trend away from a centralized data warehouse (or data lake).

There is inherent value in having a single source of truth for analytics. With modern cloud tools abstracting away the complexity distributed analytical databases, data warehousing is getting easier and more powerful.

It's true that there is added complexity in centralizing the data. But as the author of this article suggested, you're in a bad spot if your marketing team and your sales team can't agree on last month's revenue. I'm not sure how you'd solve that problem in an architecture where the data isn't getting centralized.

2 hours ago by michael_j_ward

> There is inherent value in having a single source of truth for analytics.

> I'm not sure how you'd solve that problem in an architecture where the data isn't getting centralized.

Isn't the entire point of this article that the author achieved a single source of truth without moving the data by using postgres's Foreign Data Wrappers?

2 hours ago by georgewfraser

Replicating a Postgres instance to a data warehouse takes ~1 hour to set up and $500/month, depending on the database size. The cost of the data warehouse is similar. If you can’t afford that cost, you’re not going to be able to afford an analytics team to make meaningful use of that data. In that scenario, you’re better off using Google analytics + the occasional ad hoc query of your production DB (which you should only have one of, you are at way too small scale to have multiple DBs).

Federated data systems, like what is described in this post, are a kind of “nerd trap.” They’re fascinating and people keep building them, but they have little real world utility. Everyone is either big enough where it’s worth it to build a real data warehouse, or small enough that they shouldn’t be messing around with complex data infrastructure.

3 hours ago by sa46

> decentralization; it's the only real way to deal with increasing complexity and sprawl

Decentralization can help with sprawl (aka scale) but it makes software more complex so it’s not a way to tackle complexity. Decentralization is only tolerated because, at some point, it’s the only method left to scale. As an example, distributed databases are significantly more complex to build than a single node database.

You’ve also ignored all of the downsides of using your OLTP database for OLAP queries. The most important is that adhoc queries can impact production. In my experience, that cost dwarfs the costs of maintaining an ETL pipeline as soon as you have your first outage. Another downside is that with a federated, pull based model you need to keep the mapping (FDW definition) up to date. In essence, the maintenance cost moves from an ETL pipeline to an FDW definition.

2 hours ago by chatmasta

Re: ad-hoc queries impacting production, you can mitigate this by pointing your data mesh to a read replica of the prod DB (which you can colocate with it to avoid bandwidth costs). Admittedly at that point you might be approaching a warehousing use case.

Re: maintaining the mapping. Sure that's a tradeoff. But I'd rather manage declarative FDW definitions than a programmatic ETL pipeline. Yes, handling schema changes is non-trivial (a naive starting point is periodic introspection of the remote table), but that's true in an ETL pipeline too. As you say, you're moving the cost from the ETL pipeline to the FDW definition. But what if we can lower the cost of the "FDW definition" (a.k.a. setting up and maintaining the data mesh)?

From our perspective, at least with the product we're building, we think we can make a data mesh that's low maintenance enough that it makes sense as a first default before ETL. But the two aren't mutually exclusive. A data mesh can connect multiple warehouses just as easily as it can your production databases. For example, you could move data (ETL) from your data sources to your Snowflake and BigQuery warehouses, and then connect them both to your data mesh.

Also, a mesh can add value as a unified access layer that proxies queries and applies data governance rules (e.g. sharing data, access control, column masking, query whitelisting, rewriting, rate limiting, auditing, firewalling, etc.). It would be hard to do this in your warehouse, because as soon as you have another one, or there's some data that you can't ETL into it, you lose any benefits of its access layer. (e.g. Maybe you have two warehouses because you haven't fully migrated to one, or for cost reasons like storing FireBase clickstream data in BigTable that you can't move to Snowflake with the rest of your data. Now you need to maintain two sets of accounts and ACLs for each warehouse.).

(And for what it's worth -- this is a bit off topic of the current thread -- we do recognize there is often a use case for warehousing when the tradeoffs of federation aren't worth it. We can actually act as a warehouse too, which makes it easy to start with a data mesh but have the option to bail out to warehousing for the specific data assets that need it. Ideally, you can just click a button that says something like "snapshot data" or "schedule query" or "replicate data source." To ingest and store the data, we can leverage the concept we introduced of "data images" for storing versioned snapshots of data in delta-compressed, content-addressable chunks of hot-swappable cstore_fdw files. You can create data images with a declarative, Dockerfile-like syntax called Splitfiles. They look something like FROM upstream:tag IMPORT { SELECT x, y FROM ... }). You get full provenance and can `sgr rebuild` an updated image only when its upstream changes, just like Docker.)

3 hours ago by redandblack

Same thesis for us. Pushing to use replicated sources for a data mesh, and build around query and cache services. The thesis is analysts know what they want, and operating only on production ready reports takes too much away from business. So, more power to analysts

2 hours ago by rockostrich

> BugQuery

Is this a typo or some kind of joke about BigQuery being buggy?

Anyway, agreed on this being needlessly complex and I stopped reading pretty much immediately when I saw the article continue to talk about postgres when it came to the "data lake". If you already have a postgres database backing every service and already have views defined for the data then just set up exports of those views to BigQuery/Snowflake/Redshift/etc. All of them have out of the box templates for doing exactly that.

2 hours ago by georgewfraser

Typo :)

3 hours ago by nunie123

I agree the BigQuery and Snowflake are very nice options for an analytical database.

I've used Postgres's FDWs for an analytical DB at a past job. The reason we used them was because we needed minimal analytics for the projects, and didn't want to deal with a more complex data pipeline solution. It sounds like the author was also trying to avoid a more sophisticated data pipeline solution for as long as possible.

3 hours ago by pupdogg

This is an overly complex solution that we were able to resolve using a simple VPS running Clickhouse as backend and Grafana for frontend. Our production db is an Aurora MySQL instance and we keep it lean by performing daily dumps of reporting related data into a CSV with gzip compression -> push it to S3 -> convert it to parquet file format using AWS glue -> bring it into ClickHouse. Data size for these specific reports is approx 100k rows daily and is partitioned by MONTH/YEAR. Overall cost: $20/month VPS and approx. $15/month in AWS billing.

5 hours ago by xyzzy_plugh

This is mostly what I would do at a small to medium sized startup. Everything seems really sane.

There are two potential flaws, though.

One is that while using a read replica for reporting works, there be dragons. There is a reasonable amount of tuning possible here. It's important not to conflate the reporting replica with a standby replica for prod. You also might consider asynchronous replication over synchronous replication for the reporting DB. Lastly, there are (tunables) limits to how much you can get away with, but long running queries on either end can potentially cause queries to get cancelled should they hold up replication enough. In other words, it is still very valuable to have optimized, small queries on the reporting DB where possible.

Second is that this works fine until your data doesn't fit on an instance. Large RDS instances are not cheap, and at some large table size it begins to make sense to look at warehouse solutions, like Spark on S3/whatever or Redshift or Snowflake, which can scale out to match your data size. I'd be concerned to find a rats-nest of reporting database instances when a proper cluster should be used.

3 hours ago by 0xbadcafebee

Do you have a link to an article on the tuning (or could you draft one)?

3 hours ago by xyzzy_plugh

I don't, but the most important tunables are in the replication docs: https://www.postgresql.org/docs/current/runtime-config-repli...

I'd simply suggest reading this top-to-bottom very carefully, until you understand what each option does. That should in turn help you understand how and where replication impacts the source and the sinks. The docs can be terse, but it's rare for anything to be omitted -- read with care.

3 hours ago by gizmodo59

For my home projects I generate parquet (columnar and very well suited for DW like queries) files with pyarrow and use Dremio (Direct SQL on data lake): https://github.com/dremio/dremio-oss (https://www.dremio.com/on-prem/) to query them (minio or just local disk or s3) and use Apache Superset for quick charts or dashboards.

5 hours ago by greenbcj

This seems more like “build your own data warehouse” than data lake.

4 hours ago by andrewflnr

Serious question: what's the difference? I've seen both of these terms a lot but never with a concrete definition. I get the impression neither one refers to a terribly precise concept.

4 hours ago by dijksterhuis

Depends who you ask. Traditionally speaking:

# Data lake

Data is stored en masse with no schema applied, either unstructed or structured data can be dumped straight into the lake or can be transformed and then dumped in. Turns into a data swamp when it becomes unusable due to staleness or complexity.

Data lakes are basically an AWS S3 bucket business users can access and (attempt) to do reporting on.

# Data warehouse

Heavily structured schema applied to data used in reporting, usually defines the single point of truth for business purposes. Uses a star schema model (if you follow Kimball [0] methodology) to create dimension tables used to filter and aggregate raw measurements from the central fact tables (which contain your actual measures like ÂŁ made on 1 sale).

Kimball and Inmon [1] philosophies come with their own benefits and trade offs. See bottom of [2].

Edit: got methodolgies the wrong way round with initial costs, linked article has a useful table that I didn't see.

Data warehouses have a very concrete definition and are usually implemented via Kimball's or Inmon's method. When I've worked with them they've become the bastion of business reporting (excel users love a pivot table).

---

Just to confuse matters, there's also the data vault: https://en.m.wikipedia.org/wiki/Data_vault_modeling

0: https://en.m.wikipedia.org/wiki/Ralph_Kimball

1: https://en.m.wikipedia.org/wiki/Bill_Inmon

2: https://www.zentut.com/data-warehouse/kimball-and-inmon-data...

24 minutes ago by laichzeit0

Disagree with the swamp part. See it as a staging area for the warehouse and data science. If you use your data lake as the source for your warehouse then you’re forced to keep it clean. Also, for data science use cases you need access to the raw data and the what’s in the warehouse, much easier if the data lake can be used as opposed to essentially building one anyway.

4 hours ago by milkytron

Did a quick search because I was curious, this was the first result:

> Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

2 hours ago by andrewflnr

In my defense, when I did my googling a while back, it was specifically about data lakes, not the comparison. :)

4 hours ago by contravariant

Most people I've met use 'datalake' to refer to a (categorised) collection of otherwise unprocessed data.

A data-warehouse is typically somewhat more structured and doesn't just collect data but also combines and links data from multiple sources. Typically with the goal of creating a set of tables that you can use for reporting without needing to know all the intricate details of how the source-data is linked.

A data-warehouse can be based on a datalake. You could also make a data-warehouse without first building a datalake but keeping the datalake part separate allows for better separation of concerns. You can also have datalake without building a data-warehouse on top of it, it depends on what you want to use it for.

4 hours ago by waynesonfire

For starters, you won't hit the front page with one.

4 hours ago by fouc

You're more likely to go fishing in a data lake. Nobody fishes in a warehouse.

4 hours ago by de6u99er

This

3 hours ago by 0xbadcafebee

> When the team was small and had to grow fast, no one in the tech team took the time to build, or even think of, a data processing platform.

Good! This is much better than building something well before you need it. So when do you know you'll need it? By having each team track trends in metrics for their services/work and predicting when limits will be exceeded. (This used to be mandatory back when we had to buy gear a year or two ahead) This catches persistently increasing data sizes as well as other issues. Each team should track their metrics, write a description for what its effect means for the whole product, and forward them to a product owner (along with estimates of when limits will be exceeded).

> do try to set up some streaming replication on a dedicated reporting database server for quasi-live data (or set up automated regular dump & restore process if you don't need live data)

Both of these are potentially hazardous to performance, security, regulatory requirements, customer contract requirements, etc. Replication can literally halt your production database, as can ETL, and where it goes and how it's managed is just as important as for the production DB. So look at your particular use case and design the solution just as carefully as if you were giving direct access to production.

As for the structure of it all, once you start adding outside data, you'll find it's easier to architect your solution to have multiple tiers of extracted data in different pipelines, and to expose them each to analysis independently. You can make all of them eventually land in a dedicated database, but allowing individual analysis allows you to later build purpose-fit solutions for just a subset of the data without having to constantly mutate one "end state" database (which may become vastly more complex than your production database).

Btw, the name for this kind of work is called Business Intelligence (https://en.m.wikipedia.org/wiki/Business_intelligence)

4 hours ago by chatmasta

It's really cool to see these techniques in the wild, and also feels encouraging to us as we're doing something very similar at Splitgraph [0] to implement our "Data Delivery Network" [1]. Recently we've started calling Splitgraph a "Data Mesh" [2]. As long as we have a plugin [3] for a data source, users can connect external data sources to Splitgraph and make them addressable alongside all the other data on the platform, including versioned snapshots of data called data images. [4] So you can `SELECT FROM namespace/repo:tag` where `tag` can refer to an immutable version of the data, or e.g. `live` to route to route to a live external data source via FDW. So far we have plugins for Snowflake, CSV in S3 buckets, MongoDB, ElasticSearch, Postgres, and a few others, like Socrata data portals (which we use to index 40k open public datasets).

Our goal with Splitgraph is to provide a single interface to query and discover data. Our product integrates the discovery layer (a data catalog) with the query layer (a Postgres compatible proxy to data sources, aka a "data mesh" or perhaps "data lake"). This way, we improve both the catalog and the access layer in ways that would be difficult or impossible as separate products. The catalog can index live data without "drift" problems. And since the query layer is a Postgres-compatible proxy, we can apply data governance rules at query time that the user defines in the web catalog (e.g. sharing data, access control, column masking, query whitelisting, rewriting, rate limiting, auditing, firewalling, etc.).

We like to use GitLab's strategy as an analogy. GitLab may not have the best CI, the best source control, the best Kubernetes deploy orchestration, but by integrating them all together in one platform, they have a multiplicative effect on the platform itself. We think the same logic can apply to the data stack. In our vision of the world, a "data mesh" integrated with a "data catalog" can augment or eventually replace various complicated ETL and warehousing workflows.

P.S. We're hiring immediately for all-remote Senior Software Engineer positions, frontend and backend [5].

P.P.S. We also have a private beta program where we can deploy a full Splitgraph stack onto either self-hosted or managed infrastructure. If you want that, get in touch. We'll probably be in beta for 12-18 months.

[0] https://www.splitgraph.com

[1] We talked about all this in depth on a podcast: https://softwareengineeringdaily.com/2020/11/06/splitgraph-d...

[2] https://martinfowler.com/articles/data-monolith-to-mesh.html

[3] https://www.splitgraph.com/blog/foreign-data-wrappers

[4] https://www.splitgraph.com/docs/concepts/images

[5] Job posting: https://www.notion.so/splitgraph/Splitgraph-is-Hiring-25b421...

8 minutes ago by pugworthy

Curious why this was downvoted. Was it because of the PS and PPS?

Daily digest email

Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.