Storage: files or db?

Thursday, September 26, 2019

9:21 PM

Also see SQLite as file format.

The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.

For very complex operations, the filesystem is likely to be very slow.

From <https://stackoverflow.com/questions/38120895/database-vs-file-system-storage>

Real world data are not files, they are objects. Applications map their objects into files and manage them through the application. A DBMS, on the other hand, is designed to manage objects and will allow the application to manage them directly without the need to add object management code.

Relationships between different object types are required by most applications. File systems have no object concept, so no ability to manage relationships. A DBMS is designed to provide and manage object relationships.

[…] File systems do not index objects, they index files. Needing fast access to objects, implemented through a file system, means the application needs to manage index information. A DBMS manages indexing for the application seamlessly through the database schemas.

[…] As applications get more complex so does the data management. File-based solutions are tightly coupled with the initial application requirements and extremely hard to redefine and change.

From <https://raima.com/database-system-vs-file-system/>

It is often a question of object size. At the multi-megabyte size, having multiple copies of the data in the database while a query is executing is not a problem, and the auto-synchronization of data in the database is a great benefit. At the hundreds of megabytes and larger, the overhead of moving around multiple copies of the object in the database can become a performance problem. Of course, these problems only happen if you access the large data object in a query — if you don't query the bytea column or access the large object in the query, there is no overhead except storage. Updates of rows containing large objects also don't suffer a performance penalty if the update doesn't modify the large object. (The same is true for any long value, e.g. character strings, json.)

Also, if the object is not a binary blob to the database but has a structure the database can understand then storing it in the database has other advantages. For example, doing full text search or json lookups on data can be very beneficial.

From <https://www.enterprisedb.com/blog/data-database-vs-file-system>

Databases allow indexing based on any attribute or data property (i.e. SQL columns). This helps fast retrieval of data, based on the indexed attribute. This functionality is not offered by most file systems, i.e. you can’t quickly access “all files created after 2 PM today.”

The desktop search tools like Google desktop or MAC spotlight offer this functionality. But for this, they have to scan and index the complete file system and store the information in an internal relational database.

…

Relational View of Data

File systems store files and other objects only as a stream of bytes, and have little or no information about the data stored in the files. Such file systems also provide only a single way of organizing the files, namely via directories and file names. The associated attributes are also limited in number, e.g. type, size, author, creation time, etc. This does not help in managing related data, as disparate items do not have any relationships defined.

Databases, on the other hand, offer easy means to relate stored data. It also offers a flexible query language (SQL) to retrieve the data. For example, it is possible to query a database for “contacts of all pers

ons who live in Acapulco and sent emails yesterday”, but impossible in the case of a file system.

File systems need to evolve and provide capabilities to relate different data sets. This will help the application writers to make use of native file system capabilities to relate data. A good effort in this direction has been Microsoft WinFS.

…

Druva inSync uses a proprietary file system to store and index the backed up data. The meta-data for the file system is stored in an embedded MySQL database. The database-driven model was chosen to store additional identifiers with each block – size, hash and time.

From <https://www.druva.com/blog/file-systems-vs-databases/>

A filestore is a structured collection of data files housed in a

conventional hierarchical file system. Many applications use filestores

as a poor-mans database, and the correct execution of these

applications requires that the collection of files, directories, and

symbolic links stored on disk satisfy a variety of precise

invariants. Moreover, all of these structures must have acceptable

ownership, permission, and timestamp attributes. Unfortunately,

current programming languages do not provide support for documenting

assumptions about filestores, detecting errors, or safely loading from

and storing to them.

From <https://www.cs.princeton.edu/research/techreps/TR-904-11>

Email-based File System for Personal Data

Email services have been offering growing email storagecapacity, reliable service, and powerful search capability, making them appealing as storage resources. In this pa-per, we present EMFS, which aggregates back-end storage by establishing a RAID-like system on top of virtualemail disks formed by email accounts. By replicating data across accounts from different service providers, highly available storage services can be constructed based on already reliable, cloud-based email storage. EMFS pro-vides a POSIX-like file system interface that allows ubiq-uitous data access. We have implemented a prototype of EMFS and conducted experiments in a campus network. Our results indicate that while EMFS cannot match theperformance of highly optimized distributed file systems such as NFS and AFS, it performs quite closely to JungleDisk, a commercial cloud storage solution.

From <https://www.usenix.org/legacy/event/fast11/posters_files/Srinivasan.pdf>

GmailFS implements a file system using Gmail services. YaFS [16,17] is an extensible distributed file system using heterogeneous online storage services (including email) as back-ends. Significantly, EMFS [18,19] systematically examines the email-based file system design issues, and thoroughly evaluates the effectiveness of email storage aggregation. However, our work is the first one to present an email-based storage aggregation with reliability and security. Compared with replication techniques used in EMFS, the information dispersal algorithm with fingerprinting technique applied in this paper presents a highly available and reliable storage scheme dealing with the specific security issues.

mhw: this one is theoritical, they didn't build a complete working system, as far as the encryption goes.

From <http://www.globalcis.org/jcit/ppl/JCIT3332PPL.pdf>

FUSE-based filesystem for using an IMAP server (like gmail) as normal storage like a hard disk. http://sr71.net/projects/gmailfs/

From <https://github.com/hansendc/gmailfs>

Created with OneNote.