athanassios's comments

athanassios · on Sept 5, 2023

Is it possible, or better ask how is that possible ? Although graph databases and knowledge bases are becoming more and more popular for managing and querying complex interconnected data one has to consider the potential trade-offs they present in terms of scalability, performance, and maintenance. Most important, rules specifications and application must be utilized for business logic, data validation and quality, security and access control, automation and inference among others.

In the last decade, one approach that has gained traction in addressing the challenges of graph data management is leveraging Datalog to cope with the complexity of graph data modeling, data transformations and intricate queries. Since it is a declarative logic programming language, it also offers the ability to express rules, relationships and constraints in a concise and intuitive manner. But it is only recently that appeared on the scene production systems that are based on *relational knowledge graph data models with analytics capabilities* powered by elevated versions of Datalog.

There are two of them that have drawn my attention, Relational AI and CozoDB. The former is a proprietary commercial product that runs on top of Snowflake columnar DBMS and the later is an excellent open-source embed-able production ready DBMS with several storage engines where I spent some time investigating and exploring its features. Bottom line is that these relational graph Datalog systems are covering a huge gap between SQL, noSQL, newSQL, GQL query languages at one side and Prolog/Lisp logic programming languages at the other side. According to Ziyang Hu creator of CozoDB "composability" of Datalog is the main reason that drove him to adopt it as the query language and he is not alone.

In my opinion the main obstacle for a broad adoption of Datalog based systems in graph analytics and traversals is their performance on big datasets. Is it possible to utilize vectorized parallel execution either in-memory with a fast dataframe structure like Polars or on the disk with an equally fast columnar DBMS like ClickHouse ? Truth is that I have not seen any Datalog system implementation with columnar data processing but I have discussed informally this with developers and they found that it is an interesting idea. In the past, I attempted to import a file of several hundred megabytes and execute queries on triple store Datalog systems such as Datomic, Datalevin, XTDB and the performance was very disappointing. These are mostly transactional, not built for analytics and queries on big data. CozoDB performance is a lot better, but it is still several orders of magnitude behind columnar engines/dataframes. The URL of my question points to a little benchmark I wrote to compare CozoDB with ClickHouse and Polars.

I am very interested in that technology and I am also trying to understand any technical issues that prevents someone from combining Datalog with a columnar database/dataframe. Let me know what you think, I am also willing to contribute in such a project.

refset · on Sept 5, 2023

Firstly, disclosure: I work on XTDB, where we are in the process of building out a new columnar SQL engine that also (still) supports edn-flavoured Datalog, see https://xtdb.com/v2

Perhaps the most fundamental aspect here is the inherent tension between (A) typical columnar relational schemas which primarily target efficient OLAP across very wide & sparse tables, and (B) the nature of 'graph' systems (whether OLAP or OLTP) which tend towards handling a huge number of tables and/or recursive self-joins + incremental view maintenance. I believe Michael Stonebraker's conjecture continues to be that a "one size fits all" database engine is essentially unachievable, and therefore you can't expect a single database system to be competitive across both domains simultaneously. However some databases (SQL Server springs to mind) are apparently pretty good at running multiple engines in parallel.

With the v2 XTDB engine design we are sacrificing some of our graph capabilities in the short term (WCOJ and self-join optimisations) for what we feel are more pressing advantages of the columnar model (low-cost retention of history and vectorized processing). We are also aiming to retain reasonable OLTP performance as well, in a similar vein to SingleStore etc., since the target audience is still application developers rather than data analysts.

If you are looking for a way of using Datalog on existing columnar storage engines today, you should definitely look at https://logica.dev/

Also, this Stonebraker talk on "Big Data is (at least) Four Different Problems" is pretty good: https://www.youtube.com/watch?v=KRcecxdGxvQ

EDIT: regarding "I attempted to import a file of several hundred megabytes and execute queries on triple store Datalog systems" - if you have that benchmark available somewhere I would love to take a look. You can see some information on XTDB's approach to triple store benchmarking with 'WatDiv' here: https://docs.xtdb.com/resources/performance/#watdiv

athanassios · on Sept 6, 2023

Great job, XTDB's utilization of Apache Arrow for in-memory columnar data structures is highly appreciated. Polars dataframe library in my benchmark is also based on Apache arrow. Thank you for the reference on XTDB benchmark datasets. How can I import `tsv` and/or n-triplets file into v2 XTDB so I can run a couple of queries on WatDiv ? It worth mentioning here that there is a growing interest on extracting subsets of data from large knowledge triplets bases and there are specialized tools for this (e.g. KGTK). I guess v2 XTDB with datalog will make this process a lot easier.

Well, regarding the "one size fits all" database engine, I would rephrase this as "one size fits the most" :-). Your reference about SingleStore (it has a remarkable performance similar to ClickHouse), Rowstore and Columnstore engines in one DBMS, fits the most if not all the cases. But in my opinion, the ideal solution to the DBMS-database challenge resides at the application level and within the realm of database modeling. I would contend that in today's landscape, it's the (V)ariety dimension of big data, that poses the most significant challenge – specifically, the seamless integration of data from various sources. And of course if you manage to lift the heavy load (ETL), as Stonebraker said, you run into the graph/traversal problem.

If you take into account the extensive and continuous development efforts spanning many years, along with the evolving trends in the realm of database query languages, all aimed at enhancing the often troublesome SQL and its numerous alternatives like SPARQL, graph query languages, noSQL, and newSQL, you'll begin to grasp that there is a fundamental issue underlying all of these approaches.

Primarily, it's the lack of a distinct separation between the logical and computational layer and the storage (physical layer). Only recently, this crucial aspect has begun to take center stage in the development of DBMS systems featuring multiple storage engines. Was it Datomic that played a pioneering role in this shift ? The second equally import aspect that it is still heavily underestimated nowadays is the principle of compositionality both at the level of the query language and the data types. Datalog, especially if it is used in a logic programming environment like Clojure, is in the right direction. I will argue, that several Datalog systems, which exclusively adhere to the EAV/RDF data model, lack a crucial element. They are forcing users to create queries and data models through the cumbersome process of breaking down and building n-ary relations from triplets. I believe that the remedy lies in developing robust transformation functions capable of handling input and output across triplets, tabular, or nested data structures.

PS: About Logica: The only columnar DBMS that is supported is BigQuery and it is available as Google proprietary cloud service. As it concerns in-memory processing I could not find any reference on what kind of data structures it uses.

refset · on Sept 6, 2023

> How can I import `tsv` and/or n-triplets file into v2 XTDB so I can run a couple of queries on WatDiv ?

I can't offer easy instructions for that right now as we've not adapted the benchmark for the 2.x branch yet (and I would not expect performance to match 1.x right now anyway), although all the code one might need is in the repo for 1.x already and should be straightforward enough to convert to the new APIs given some familiarity with Clojure: https://github.com/xtdb/xtdb/blob/master/bench/src/xtdb/benc...

> Was it Datomic that played a pioneering role in this shift ?

I know Datomic has inspired a lot of people - certainly our entire team - but I'm not sure how much it's influenced the existence of things like Cozo or the wider database ecosystem. I get the feeling that Datomic's impact on application developer communities and library builders has been far greater (also amplified further by DataScript) than the impact on established database vendors & research communities. It has at least been good to see more systems start treating data immutably and exposing time-travel capabilities, even if Datomic was merely ahead of the curve rather than a direct inspiration.

> They are forcing users to create queries and data models through the cumbersome process of breaking down and building n-ary relations from triplets. I believe that the remedy lies in developing robust transformation functions capable of handling input and output across triplets, tabular, or nested data structures.

Despite originally being attracted to triple-oriented modeling I have come around to agreeing with this point of view. Processing with n-ary relations is unavoidable so the idea of not persisting them also feels inherently limiting. I am also quite drawn to "Object Role Modeling" (though I've not yet applied it in anger) which really demands n-ary tuples, and the RelationalAI folks seem to have reached a similar conclusion: https://docs.relational.ai/rel/concepts/relational-knowledge...

> in my opinion, the ideal solution to the DBMS-database challenge resides at the application level and within the realm of database modeling

I generally agree with this too. All I know for certain is that almost everything in software should be more relational and declarative, and then machine learning stands the best chance of figuring out how to make things fast.

Good discussion :)