Build your own “data lake” for reporting purposes

4 years ago by pupdogg

This is an overly complex solution that we were able to resolve using a simple VPS running Clickhouse as backend and Grafana for frontend. Our production db is an Aurora MySQL instance and we keep it lean by performing daily dumps of reporting related data into a CSV with gzip compression -> push it to S3 -> convert it to parquet file format using AWS glue -> bring it into ClickHouse. Data size for these specific reports is approx 100k rows daily and is partitioned by MONTH/YEAR. Overall cost: $20/month VPS and approx. $15/month in AWS billing.

4 years ago by twotwotwo

Appreciate the info on real-world use here. A data warehouse would be sort of interesting at work but is not urgently needed (because reporting from MySQL works, without the nifty speedups data warehouses can achieve), and we're somewhere between the size you're talking about and the really-big-data use cases I tend to see blogged about more often. Am curious about ClickHouse and a lower-cost deployment might make it worthwhile when it wouldn't be otherwise.

4 years ago by ekianjo

You do not even need to use AWS. use Minio as S3 compatible system and Nifi to convert files to parquet once they land in Minio... No dependency on AWS.

4 years ago by antman

Minio and nifi, require lots of resources for themselves. Better off using pure python and if one wants something lighweight and visually pleasing Mara [0] or Dagster with Dagit [1] will do the job

[0] https://github.com/mara/mara-pipelines

[1] https://docs.dagster.io/tutorial/execute

4 years ago by ekianjo

Yes, Minio and Nifi are not lightweight but if you look for robustness they have been used in large environments and proven to scale.

4 years ago by georgewfraser

This is a needlessly complex solution. You will get better performance, with simpler maintenance, by replicating everything into an appropriate analytical database (Snowflake and BugQuery are both good choices). Setting up multiple Postgres database and linking them together with foreign data wrappers is interesting to blog about, but it’s an extremely roundabout way to solve this problem.

4 years ago by chatmasta

It seems difficult to make the argument that replicating data to a warehouse is "simpler" than leaving the data in its original place (which could actually be a warehouse) and querying it through an FDW. In your scenario you need to maintain and monitor an ETL pipeline, pay storage costs for the warehouse, and likely pay bandwidth costs for moving your data into it. In the blog post scenario, the only cost is a Postgres instance and bandwidth for the data your queries return.

There are many cases where replication to a warehouse makes sense, but there are also plenty where it doesn't. Our philosophy at Splitgraph, where we're building something just like this blog post, is that querying should be cheap and easy in the experimentation phase. Analysts shouldn't need to ask engineers to setup a data pipeline just so they can query some data they might not even want. So why not start with a "data lake" (or "data mesh")? You can query any data connected to the mesh without any setup or waiting time. If you eventually decide that you don't want the tradeoffs of federation, then you can selectively warehouse only the data you need to optimize certain queries.

(I might go even further and suggest that the data industry has the idea of the "modern data stack" completely wrong. Replicating data to a warehouse is the ultimate act of centralization. Every other aspect of software is trending toward decentralization; it's the only real way to deal with increasing complexity and sprawl. It's only a matter of time before this paradigm shift hits the data stack too.)

4 years ago by nunie123

I agree with much of what you said.

However, I'm disinclined to believe there will be a trend away from a centralized data warehouse (or data lake).

There is inherent value in having a single source of truth for analytics. With modern cloud tools abstracting away the complexity distributed analytical databases, data warehousing is getting easier and more powerful.

It's true that there is added complexity in centralizing the data. But as the author of this article suggested, you're in a bad spot if your marketing team and your sales team can't agree on last month's revenue. I'm not sure how you'd solve that problem in an architecture where the data isn't getting centralized.

4 years ago by michael_j_ward

> There is inherent value in having a single source of truth for analytics.

> I'm not sure how you'd solve that problem in an architecture where the data isn't getting centralized.

Isn't the entire point of this article that the author achieved a single source of truth without moving the data by using postgres's Foreign Data Wrappers?

4 years ago by georgewfraser

Replicating a Postgres instance to a data warehouse takes ~1 hour to set up and $500/month, depending on the database size. The cost of the data warehouse is similar. If you can’t afford that cost, you’re not going to be able to afford an analytics team to make meaningful use of that data. In that scenario, you’re better off using Google analytics + the occasional ad hoc query of your production DB (which you should only have one of, you are at way too small scale to have multiple DBs).

Federated data systems, like what is described in this post, are a kind of “nerd trap.” They’re fascinating and people keep building them, but they have little real world utility. Everyone is either big enough where it’s worth it to build a real data warehouse, or small enough that they shouldn’t be messing around with complex data infrastructure.

4 years ago by sa46

> decentralization; it's the only real way to deal with increasing complexity and sprawl

Decentralization can help with sprawl (aka scale) but it makes software more complex so it’s not a way to tackle complexity. Decentralization is only tolerated because, at some point, it’s the only method left to scale. As an example, distributed databases are significantly more complex to build than a single node database.

You’ve also ignored all of the downsides of using your OLTP database for OLAP queries. The most important is that adhoc queries can impact production. In my experience, that cost dwarfs the costs of maintaining an ETL pipeline as soon as you have your first outage. Another downside is that with a federated, pull based model you need to keep the mapping (FDW definition) up to date. In essence, the maintenance cost moves from an ETL pipeline to an FDW definition.

4 years ago by chatmasta

Re: ad-hoc queries impacting production, you can mitigate this by pointing your data mesh to a read replica of the prod DB (which you can colocate with it to avoid bandwidth costs). Admittedly at that point you might be approaching a warehousing use case.

Re: maintaining the mapping. Sure that's a tradeoff. But I'd rather manage declarative FDW definitions than a programmatic ETL pipeline. Yes, handling schema changes is non-trivial (a naive starting point is periodic introspection of the remote table), but that's true in an ETL pipeline too. As you say, you're moving the cost from the ETL pipeline to the FDW definition. But what if we can lower the cost of the "FDW definition" (a.k.a. setting up and maintaining the data mesh)?

From our perspective, at least with the product we're building, we think we can make a data mesh that's low maintenance enough that it makes sense as a first default before ETL. But the two aren't mutually exclusive. A data mesh can connect multiple warehouses just as easily as it can your production databases. For example, you could move data (ETL) from your data sources to your Snowflake and BigQuery warehouses, and then connect them both to your data mesh.

Also, a mesh can add value as a unified access layer that proxies queries and applies data governance rules (e.g. sharing data, access control, column masking, query whitelisting, rewriting, rate limiting, auditing, firewalling, etc.). It would be hard to do this in your warehouse, because as soon as you have another one, or there's some data that you can't ETL into it, you lose any benefits of its access layer. (e.g. Maybe you have two warehouses because you haven't fully migrated to one, or for cost reasons like storing FireBase clickstream data in BigTable that you can't move to Snowflake with the rest of your data. Now you need to maintain two sets of accounts and ACLs for each warehouse.).

(And for what it's worth -- this is a bit off topic of the current thread -- we do recognize there is often a use case for warehousing when the tradeoffs of federation aren't worth it. We can actually act as a warehouse too, which makes it easy to start with a data mesh but have the option to bail out to warehousing for the specific data assets that need it. Ideally, you can just click a button that says something like "snapshot data" or "schedule query" or "replicate data source." To ingest and store the data, we can leverage the concept we introduced of "data images" for storing versioned snapshots of data in delta-compressed, content-addressable chunks of hot-swappable cstore_fdw files. You can create data images with a declarative, Dockerfile-like syntax called Splitfiles. They look something like FROM upstream:tag IMPORT { SELECT x, y FROM ... }). You get full provenance and can `sgr rebuild` an updated image only when its upstream changes, just like Docker.)

4 years ago by redandblack

Same thesis for us. Pushing to use replicated sources for a data mesh, and build around query and cache services. The thesis is analysts know what they want, and operating only on production ready reports takes too much away from business. So, more power to analysts

4 years ago by glogla

This is fundamentally a data virtualisation approach - even though it is achieved by streaming replicas and fdws instead of tradictional data replication tools.

It is probably more complex than replicating a single database to a Snowflake, but one huge benefit of the described solution is that it's real-time - and getting realtime replication to data warehouse is a very different challenge.

It allows you to create both batch and realtime data products just by creating views or materialized views, which is pretty simple and neat. You also get real-time ad-hoc querying, which is very valuable.

The issue with this approach is it keeps the data in row-based format so from certain size it isn't going to be practical. For the "daily report" use case, it's probably fine for a while, but if the source system grow to terabyte size, you will be spending a lot of time waiting on Postgres to do it's thing.

So as the data grows, you might lose first the real-time ad-hoc querying (if you ever had it), then the real-time reporting and then the daily reporting.

But maybe the data will never grow that much and everything is going to be fine for a long time? Maybe this gets you where you need to be for a few years and who knows how the data platforms will look like in a few years.

4 years ago by rualca

> You will get better performance, with simpler maintenance, by replicating everything into an appropriate analytical database

If you intentionally get rid of your original data then you're saying goodbye to provenance tracking and also open the door to lose information with each conversion you believe makes things "simpler to maintain".

4 years ago by nunie123

I agree the BigQuery and Snowflake are very nice options for an analytical database.

I've used Postgres's FDWs for an analytical DB at a past job. The reason we used them was because we needed minimal analytics for the projects, and didn't want to deal with a more complex data pipeline solution. It sounds like the author was also trying to avoid a more sophisticated data pipeline solution for as long as possible.

4 years ago by xyzzy_plugh

This is mostly what I would do at a small to medium sized startup. Everything seems really sane.

There are two potential flaws, though.

One is that while using a read replica for reporting works, there be dragons. There is a reasonable amount of tuning possible here. It's important not to conflate the reporting replica with a standby replica for prod. You also might consider asynchronous replication over synchronous replication for the reporting DB. Lastly, there are (tunables) limits to how much you can get away with, but long running queries on either end can potentially cause queries to get cancelled should they hold up replication enough. In other words, it is still very valuable to have optimized, small queries on the reporting DB where possible.

Second is that this works fine until your data doesn't fit on an instance. Large RDS instances are not cheap, and at some large table size it begins to make sense to look at warehouse solutions, like Spark on S3/whatever or Redshift or Snowflake, which can scale out to match your data size. I'd be concerned to find a rats-nest of reporting database instances when a proper cluster should be used.

4 years ago by 0xbadcafebee

Do you have a link to an article on the tuning (or could you draft one)?

4 years ago by xyzzy_plugh

I don't, but the most important tunables are in the replication docs: https://www.postgresql.org/docs/current/runtime-config-repli...

I'd simply suggest reading this top-to-bottom very carefully, until you understand what each option does. That should in turn help you understand how and where replication impacts the source and the sinks. The docs can be terse, but it's rare for anything to be omitted -- read with care.

4 years ago by snidane

It seems the article suggests piping around and querying tabular data in postgres. I don't see a need for the data lake part.

Data Warehouses are used to store and process anything which looks like a structured table, or nested tables, so called semi-structured data.

Data lakes are for everything else. Your SQL warehouse can't process or is not ergonomic for processing of:

  - purchased data, shared as a zip archive of csv files
  - 10 level deep nested json files coming from api calls
  - html files from webscrapes
  - contents of ftp containing 100s of various csv and other files
  - array data used for machine learning algorithms
  - pickled python ml models
  - yaml configs
  - pdf documents such as data dictionaries
  - materialized views over raw data

Besides your DWH, you need to have a storage layer, where you store these files and raw data. This is the main reason for why companies have data lake projects. Without some centralised oversight and discipline it only results in a big mess. Note that the centralisation doesn't have to be company-wide, each team or department can maintain their own data lake standard. The more centralised you do it, the more economy of data scale you get, but the harder it gets to enforce and maintain with higher chance of turning into mess again.

I think Data Mesh concept is proposing structuring your org as a bunch of data producing teams, each maintaining their own data lake, instead of having one huge ass lake mess in the middle.

Tools like delta.io and Databricks are giving data lakes full capabilities of a data warehouse, so the difference between data lake and data warehouse is diminishing. These days you can get away without a dedicated DWH and just store everything in a blob store and plug in short-lived processing engines as you wish without vendor lock-ins.

4 years ago by gizmodo59

For my home projects I generate parquet (columnar and very well suited for DW like queries) files with pyarrow and use Dremio (Direct SQL on data lake): https://github.com/dremio/dremio-oss (https://www.dremio.com/on-prem/) to query them (minio or just local disk or s3) and use Apache Superset for quick charts or dashboards.

4 years ago by unixhero

What does Parquet do for your problems? I am trying to learn for when I might need it?

4 years ago by gizmodo59

There are several advantages compared to CSV, JSON or other non-columnar formats as far as analytics is concerned which are typically not transactional.

https://blog.openbridge.com/how-to-be-a-hero-with-powerful-p... https://stackoverflow.com/a/36831549

And with projects like Apache Iceberg you can also have ACID transactions which will make it easy to update or delete just using SQL. This really opens up the separation of compute and data and you can use any engine you want (spark, drill, hive, impala, athena, dremio, redshift spectrum etc) on top of your files.

4 years ago by greenbcj

This seems more like “build your own data warehouse” than data lake.

4 years ago by andrewflnr

Serious question: what's the difference? I've seen both of these terms a lot but never with a concrete definition. I get the impression neither one refers to a terribly precise concept.

4 years ago by dijksterhuis

Depends who you ask. Traditionally speaking:

# Data lake

Data is stored en masse with no schema applied, either unstructed or structured data can be dumped straight into the lake or can be transformed and then dumped in. Turns into a data swamp when it becomes unusable due to staleness or complexity.

Data lakes are basically an AWS S3 bucket business users can access and (attempt) to do reporting on.

# Data warehouse

Heavily structured schema applied to data used in reporting, usually defines the single point of truth for business purposes. Uses a star schema model (if you follow Kimball [0] methodology) to create dimension tables used to filter and aggregate raw measurements from the central fact tables (which contain your actual measures like £ made on 1 sale).

Kimball and Inmon [1] philosophies come with their own benefits and trade offs. See bottom of [2].

Edit: got methodolgies the wrong way round with initial costs, linked article has a useful table that I didn't see.

Data warehouses have a very concrete definition and are usually implemented via Kimball's or Inmon's method. When I've worked with them they've become the bastion of business reporting (excel users love a pivot table).

---

Just to confuse matters, there's also the data vault: https://en.m.wikipedia.org/wiki/Data_vault_modeling

0: https://en.m.wikipedia.org/wiki/Ralph_Kimball

1: https://en.m.wikipedia.org/wiki/Bill_Inmon

2: https://www.zentut.com/data-warehouse/kimball-and-inmon-data...

4 years ago by laichzeit0

Disagree with the swamp part. See it as a staging area for the warehouse and data science. If you use your data lake as the source for your warehouse then you’re forced to keep it clean. Also, for data science use cases you need access to the raw data and the what’s in the warehouse, much easier if the data lake can be used as opposed to essentially building one anyway.

4 years ago by contravariant

Most people I've met use 'datalake' to refer to a (categorised) collection of otherwise unprocessed data.

A data-warehouse is typically somewhat more structured and doesn't just collect data but also combines and links data from multiple sources. Typically with the goal of creating a set of tables that you can use for reporting without needing to know all the intricate details of how the source-data is linked.

A data-warehouse can be based on a datalake. You could also make a data-warehouse without first building a datalake but keeping the datalake part separate allows for better separation of concerns. You can also have datalake without building a data-warehouse on top of it, it depends on what you want to use it for.

4 years ago by milkytron

Did a quick search because I was curious, this was the first result:

> Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

4 years ago by andrewflnr

In my defense, when I did my googling a while back, it was specifically about data lakes, not the comparison. :)

4 years ago by waynesonfire

For starters, you won't hit the front page with one.

4 years ago by 1996

Some people want to deploy cool technology to hit the frontpage

Other people want to deploy robust, tested and tried solutions, that won't break in mysterious ways, just to make money.

I side with the later.

4 years ago by fouc

You're more likely to go fishing in a data lake. Nobody fishes in a warehouse.

4 years ago by de6u99er

This

4 years ago by throwaway346434

I feel some of the negative comments miss the point, at least of a way of structuring reporting extracts and presenting them in an easy to maintain way for services: you are signing a contract with the data team and taking on their concerns with an approach like this. Your unit tests fail if you are about to break the contract with a change, and you discover it upfront, rather than after rolling out a new service that changes a definition unexpectedly.

4 years ago by unixhero

Thanks for giving back to the community in the form of these meditations on Postgres, real world business scenario analytics and keaaons learned.