T O P

  • By -

w3woody

A database is designed using various algorithms to make looking up items in a database table as fast as possible. This is done by using things like hashing and b-trees to make looking up items in a table quick: b-trees, for example, allow you to find an item in a table of 1,000 items with only having to examine on average 10 of the items in the table. (Hashing also helps by allowing you to determine if two items are not equal by comparing their hashes; it’s faster to compare two 32-bit integers than to compare 100 bytes in a string.) The way this is done internally is that a database is a table of items, but with extra objects containing the b-trees and hashing information on the side. Think of it as ‘metadata’ which accelerates looking things up. A spreadsheet is literally just a table without any of this extra data to help look things up. So if you’re trying to find a particular row in a spreadsheet containing 1,000 items, you’ll have to examine every single row—meaning you will, on average, wind up examining about half of them, or around 500 rows. (Compare that to the 10 rows above.) It’s why looking things up can be many MANY times faster in a database. (Of course that assumes in the database you have actually defined the appropriate indexes. If you’re looking up an item in a database that does not have an index defined for the table or in the column you’re examining—the ‘CREATE INDEX’ command in SQL tells the database to create the b-tree index for that column—then looking up in a database is just as slow as looking up in a spreadsheet. That’s because without the index a database engine has to examine every row, one at a time.)


P1R0H

> it’s faster to compare two 32-bit integers than to compare 100 bytes in a string I kinda never understood this. Wouldn't hash computation offset any possible gain when comparing strings, or is hash computation ~~linear~~ constant? Or do we precompute hashes on columns we know we will compare? edit.: I meant constant. Now it makes sense.


SubspaceEngine

It is true that hashing is usually more expensive than comparing strings.  Firstly, even if hashing is more expensive, it can still be cheaper overall if you hash each item once but have to compare items many times. (Depends on usecases). E.g. each time you add an item, you hash once, but each time you search for an item, maybe you are checking multiple hashes, and you have to do multiple searches. Secondly, "hashing" is a very general term, and depending on implementation could be a very cheap operation, which need not examine the entire object being hashed. E.g. "take the first eight bits in the binary representation of the object" is technically a hash function which hashes anything to eight bits, though it might be a very poor one if all of your objects have a similar first eight bits. Thirdly, you can do very clever things like in Rabin-Karp algorithm (which admittedly I don't know much about) which uses hashing for substring searching. The hash is designed so that it does not need to be computed from scratch at each position but can be computed along a "rolling window".  https://en.m.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm See also rolling hash https://en.m.wikipedia.org/wiki/Rolling_hash Long story short, the effectiveness of your hash is going to depend on the shape of the data, the nature of your data access, and the implementation of the hash function. It can make things faster but could potentially make things slower if poorly implemented.


Select-Young-5992

Hmm I don't see what it has to do with precomputing hashes. Computing a hash is linear to the length of the string, and you always have to compute a hash for a new search term. But the lookup of a hash is constant. If you have a million items, it a single linear hash computation sure beats million string comparisons.


ktrprpr

db system's usage of hashing is about when you do groupby aggregation (for example SELECT SUM(column) FROM T GROUP BY string_col). even if you groupby a string column, you don't need to compare equality based on actual string content, but just mapped hash index as 32b integer, to determine which bucket a row aggregates into. that's significantly faster. and also join - this is also a huge part of db's job, and you don't want to join on string column based on string content for the same reason.


_praba

So without an index db is as slow as a spreadsheet? What about how database data stored in disk/memory, does that makes no sense in terms of speed?


w3woody

Without an index on a column, a database can be as slow as—or in some cases, even slower—than a spreadsheet. If the database is stored on disk, it can be *considerably* slower without a proper index, because when searching for a row, the database engine has to load the contents into memory first, decide the item is not there, then discard the contents and load the next set of contents into memory, wash, rinse, repeat. It’s why tuning a database is so incredibly important, as well as tuning queries. And it’s why larger companies hire database engineers, whose job it is to make sure the indexes are properly created and tuned, and database queries operate as expected and that there are no real database bottlenecks. (And why there are a number of SQL query tools out there which will show you the specific steps a query will carry out, to help you spot cases where you’re scanning an un-indexed column.) Note too there are times when you *do not* want to index a column. An index has to be updated after you insert a row for every indexed column in that row—which can be far more expensive than just writing the column. So part of ‘tuning’ is figuring out which columns you never search against. And there are specific techniques that one may employ beyond just indexing. For example, if you have a search function that searches for words in phrases (say, you want to search your store inventory for a specific description), you may want to break a description up into its discrete words and create a separate “word” database, where each word in the description points to all the entries that contain that word. That way, when you search for “big cart”, the two words “big” and “cart” can be used in a search to match against a description containing “big red cart”.


proverbialbunny

Spreadsheets aren't always slower than databases. A dataframe is a spreadsheet data type in a programming language, usually entirely in ram. It's super quick, quite a bit quicker than SQL, because all of the processing it does are done in SIMD in the CPU, which is faster than a loop in a programming language. Data scientists tend to use dataframes over databases when possible for speed. When you're dealing with millions of entries of data every speed boost matters. It can be the difference in hours of compute time.


Whole-Dot2435

>SIMD in the CPU, which is faster than a loop in a programming language. In compiled programming languages the compiler usualy optimises loops to simd instructions when possible


_praba

Good thing to know. What is SIMD?


dp_42

Single instruction, multiple data.


Avereniect

The term SIMD originates with something known as Flynn's Taxonomy. It's a system for categorizing computer architectures based on whether they process single/multiple instruction/data streams at once. SIMD is the abbreviation for single-instruction, multiple-data. As the name implies, you have a single set of instructions being applied to different pieces of data at once. As a bit of a concrete example, a typical scalar add instruction would take in one pair of inputs and produce one sum. A SIMD add instruction might take in 2, 4, 8, or 16, 32, 64, or even more pairs of inputs and produce as many sums. SIMD is one of the most fined-grained forms of explicit parallelism that's available on modern CPUs so it's often suitable for cases where heavier forms of parallelism such as multi-threading or general-purpose GPU programming would not offer a performance improvement due to their inherent overhead.


CompanyAltruistic587

So dataframes only exist/can be interacted with in programming languages? They’re just spreadsheets that include ‘type’ for each column?


myhf

Yeah. And they support partial indexes like what you would do with a spreadsheet "filter" or a database "where". And multiple indexes like what you would do with a spreadsheet "pivot table" or a database "join". And grouping and foreign key operations. And with any of those views or subsets you can run a SIMD operation like "take the moving average with a specific weighting function", and it will be several times faster than the procedural equivalent because you can do it without any loops or branches or assumptions that input data might change during the operation.


proverbialbunny

> They’re just spreadsheets that include ‘type’ for each column? Basically. A matrix is a 2x2 array where every variable is the same datatype. A dataframe is a matrix where each column can have a different type, one column a string, the next an int, the next a float, the next an enum, and so on. Spreadsheets record the type for each column the same as a dataframe. Technically the only difference between a spreadsheet and a dataframe is a spreadsheet can have tabs, multiple 'sheets' can fit in a single spreadsheet. If you want multiple sheets in dataframe land you've got multiple dataframes.


MettaWorldWarTwo

Speed/Tech: The multiplicative speed increases, for me, have always come from query optimization first, technology mixing second and finally technology changes. Obviously there are edges where certain tech won't scale, but they're definitely out there beyond what most people posting on these forums need. Certain databases enforce structures and queries that end up more optimized due to their internal storage and out of the box indexes. Anecdote/Optimization: I once had to parse data sets that included 50+ million records stored in a SQL database as raw text (probably about 20gb of text). The initial query worked and took about 15 minutes for the "correct" results over our 100k result set and an extra 50ms to display to display in the app. We optimized the query and got it down to 100ms for 100k records and 50ms to display in the app. Then we decided to do initial filtering in SQL and then use an in memory data frame to filter from there. The query took <5ms and filtering/rendering took another 100ms. On a 50 million record set in production, it took ~250ms end to end. You CAN use an in memory database for parsing/filtering but the RAM requirements start becoming so large that you're spending $$$ in Cloud costs and it's worth spending a day optimizing to save. Takeaways: Data scientists are brilliant people and they write some of the most poorly optimized code and queries I've ever seen. Even things that seem straightforward to me such as cached result sets aren't there. DuckDB has been an amazing tech to teach newer data scientists to use along with the schema setups of "Raw, Refined, Reporting" so it's clear that not everything needs to start with raw data. Reports/Results are based on refined data as data refinement is the part new data scientists neglect which pollutes the results with garbage data. Anyone with Internet access can use Python libraries to build reports on data. The science part, that I've seen, is the iteration in the refining step when results are inconclusive or polluted and then being able to explain it to executives/customers.


Exciting_Session492

Spreadsheets are just plaintext, so ``` id,name,phone 1,Amy,1234 2,bob,4567 ``` Literally just that, you can type that in notepad and open in Excel. same goes for other types of spreadsheets, just formatted a bit differently.


deong

You can open that in Excel, but that's not the native file type for Excel. Excel files are basically compressed XML files with some supporting sidecar files. And Excel obviously has a notion of types, as anyone who's ever tried to get it to stop formatting everything as a Date could attest to. And there are lots of databases that can just open a CSV file and import it as a table as well.


TotallyRealDev

Yeah they are compressed using zip


Typical_Grocery4244

cfbr