The case for database connectivity #6683

simonaubertbd · 2023-12-30T11:21:20Z

simonaubertbd
Dec 30, 2023

Hello all,

I open this discussion in order to trying to get the idea clear about database connectivity. It may contain mistakes and misunderstandings. Feel free to disagree.

The situation today

Orange Data Mining allows connexion to SQL Server and Postgresql (https://orangedatamining.com/widget-catalog/data/sqltable/ ). So that, you can query data in your database (but not writing in it back). For anything else, you have to extract the data in files and then you can use Orange Data Mining.

Why does it matter?

In companies, corporate environment, we try to avoid to use files and prefer to using databases instead. It helps to centralize, avoid duplicate, manage access rights, compliance, etc. I understand it's different in research field where things are less "industrialized". Having a limited access to db means that Orange Data Mining won't be used in many cases just because of this lack of connectivity. It also means less popularity, smaller community while it deserves far more recognition!

How does it work for other solutions?
From my perspective and with my own knowledge of ETL/dataprep/data mining software, you have two kinds of approach :
-pure in-db : the solution backend generates sql in the database for the different operations
-db to in-memory /in-memory to db : for input, the solution retrieves the data from db and stores it temporary in the machine executing the job. for output, it inserts the result data into the db.
A lot of solutions offer both approaches, since both have some value (first one is usually faster/ligher and the second one allows far more features since it's not limited to SQL)

To illustrate : here some slides I have done for another solution.
Comparison betwee in -db (dark blue) and in-memory workflow

How to connect Orange Data Mining to database in the future?
Well, it's not that simple otherwise no doubt it would be already done. ODM is in python so you have to consider that. We can find several technologies :
=> sqlalchemy 2.0 : a great library, working well for most common databases https://www.sqlalchemy.org/ compatibility here https://docs.sqlalchemy.org/en/21/dialects/index.html
=> ODBC: (with pyodbc library?) . In corporate environment, ODBC is widely used and almost all db have an odbc driver.
=> OLEDB : (with library PyADO ?)==> not too much used but faster than ODBC
=> jdbc : (with library JayDeBeApi ?) => used a lot by software such as db visualizer, squirrel, dbeaver and even Knime.
=> bulk load (output only)=> very fast way to insert data.
=> there are also some dedicated libraries for databases but I don't think it's the best way to go (there are like 140 database engines, using standard connectivity is key !)

How do I see the thing?

At first, I would focus on the second approach since it means less work and already brings a lot of value.
=>input widgets, output widgets by technology (maybe except OLEDB), the sql column type (varchar, bigint, etc... can be transfered by key/value system visible in edit domain)
=> input : a query or a table/view (to know wich one... from is present or a dropdown menu). if a table view, the query will be select * from {toto.tata}
=>output : ability to choose partition, to pass statistics (table, column, partition), to choose index...to write a "post sql query".

Storage of connection for quick access (it's not common to use more than 5 connections in real life and configuring it for each workflow is not efficient)
=> connection vault (name stored in tool). a connection is basically path, technology used +credentials
=>credentials vault (name stored in tool if used directly on widget, either credentials is stored in connection)

what help can I offer?
Ok, you got me : I don't have the skills to develop it myself. However :
-I can help desiging the UI.
-I can help testing
-I can take some time and learn how to develop (but well, it's free time)

Best regard,

Simon

janezd · 2024-01-02T14:53:43Z

janezd
Jan 2, 2024
Maintainer

Simon, thank you for your analysis.

Orange has many add-ons - from single-cell genetics to analysis of networks and time series. Their maintenance is sporadic because developers (and funding) are spread too thin and we must prioritize.

As for priorities: as you wrote, Orange is useful in education, academia and industry. As most of other open source projects, it is basically unfunded, and developers do it as a hobby or by piggybacking on other, related academic projects. As such, our focus is on supporting education and academia which are weaker on funds and thus in need of good open source solutions, while for-profit institutions are welcome to use it (except in cases where they just repackage Orange and re-sell it as SaaS, about which we have a strong opinion), but they should, in my opinion, rely on paid solutions when the open source ones fall short.

I agree with your assessment that databases are more relevant for business environment and less for academia - and even less for education -, hence we do not prioritize them. If we got two full-time developers for Orange right now, I would vote for puting one on the time-series add-on, and the other on networks since improving these two would make the greatest contribution for academia and education.

Orange actually supports pure in db data: widgets can essentially pass queries to each other and do their work entirely on the db side. Yet, this functionality, which was developed within a particular project, is not being maintained and is only supported by several widgets, while in others the SQL data is (automagically) read into memory. This story (and a few others) taught us that development is easy and maintenance is hard. Even if some company paid the development of database-related functionalities (which is quite unlikely), I would probably oppose adding this to core Orange because nobody would maintain it, and five years from now we'd discuss whether to keep the dead code or to remove it.

As even the small widget that connects to Postgres is mostly untouched because none of the core developers is interested in it, a solution towards your second approach could be an add-on that would have all kinds of connections to databases. Add-ons can be developed by interested third-parties and if maintenance is discontinued, Orange itself does not suffer because of it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The case for database connectivity #6683

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

The case for database connectivity #6683

simonaubertbd Dec 30, 2023

Replies: 1 comment

janezd Jan 2, 2024 Maintainer

simonaubertbd
Dec 30, 2023

janezd
Jan 2, 2024
Maintainer