This is based on one of the tech quarterly meetings I had during FormulaFolios to present my solution from the FormulaFolios Data Processing Optimization research and fix on September 2020. The following process can be done on MySQL or PostgresSQL but I reckon this can be done with any relational based databases.

Process of Database Fine Tuning

  • First we diagnose what the problem is.
  • Second we analyze the costs and types of the queries.
  • Finally we implement our fixes.

Diagnose 🧪

You want to look for the problematic queries first. Ask yourself the following questions:

  • How slow are they?
  • How often are they called?
  • Does it affect users?

When checking how slow the query are, you can use the explain or EXPLAIN ANALYZE on your SELECT query to see the cost analysis of each query in details.

If you’re on Ruby on Rails, Rails have a built in explain method built into ActiveRecord. If you can’t access your database terminal and are in your rails console instead, you can run it like the following:

Model.where(type: "test", power: 3).joins(:model_types).explain


It’ll return you the query and a table with the cost analysis.

Analyze 🔬

Cost is the most important thing to check. Total time is what will matter for most queries.

Let’s take the following query as an example:

EXPLAIN ANALYZE
    SELECT "products".* FROM "products" WHERE (symbol = 'SPY')

You’ll probably get something like:

Seq Scan on products (cost=0.00..33.98 rows=1 width=99)
    Filter:: ((symbol)::text = 'SPY'::text)

Cost is arbitrary, but it is equivalent to disk page fetches. The first number is the start-up cost and the second number is the total cost to return all the rows.

In general, for most queries, the total cost is what matters.

Types of Scans

Further readings on the types of scans can be found on https://www.cybertec-postgresql.com/en/postgresql-indexing-index-scan-vs-bitmap-scan-vs-sequential-scan-basics/ if you’re interested in further details in regards to SQL scans and their performances, but I’ll summarize the basic 3 while analyzing.

Sequential Scan

  • Standard O(n) iteration of 1 row at a time
  • Used when no indices are implemented or majority of the table will be returned

Bitmap Scan

  • Select a lot of rows, but not enough to filter out the entire table to use Sequential
  • Runs in logarithmic time

Index Scan

  • Only returns a small handful of rows selected from a table
  • Runs in O(1) time (constant)

Notes in Regards to Scans

If you don’t have many data, explain will most likely always use a sequential scan.

If the query is already using an index or bitmap scan, then it is already optimized.

What to Fix

Knowing what you know about scans now, ask yourself the following questions about what queries you should prioritize:

  • Is it a scheduled job that runs in the middle of the night and doesn’t bring a server down?
    • Not a problem
  • Is it something that retrieves data on a page that is used with high volume of traffic whenever a user clicks a button?
    • Problematic
    • Imagine the multiplied domino effect causing the job to be stuck in the queue until all the resources are exhausted
      • This will definitely affect your website negatively causing slowdowns or even bringing it down

An important rule of thumb is to “forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

Unless something needs to be optimized, we do not need to prematurely optimize everything as developer time, research, and efforts are expensive delaying products from being shipped when we are not sure whether the product in question will be short lived (deleted later in the future) or built to stay.

Fixes 🔧

There are several options to improve database performances:

  1. Insert indices for the offending table
  2. Cache query to avoid multiple large calls to the database
  3. Partition the tables
  4. Reduce frequency the table is called if possible
  5. Optimize your queries

The solution I utilized while optimizating the data processing jobs was a combination of inserting indicies and implementing a cache on an expensive query. I’ll go over the types of indices and some of the pros/cons as well as caching.

Indexing

Generally if you’re going to look up a certain table with a certain key, you’ll want to have an index on that. If you’re going to use multiple keys to lookup, it is best to have a composite index. Composite index reads from left to right.

For example when designing a certain table and a certain callback always make these queries:

SELECT * FROM personnels WHERE staff_id=1 AND office_id=2 AND department_id = 42
SELECT * FROM personnels WHERE staff_id=3 AND office_id=3
SELECT * FROM personnels WHERE staff_id=2

We’ll want a composite index on the following in this order:

staff_id
office_id
department_id

It would be redundant to have an index on each of them alone or even 2 of them together as the composite index that includes 3 will cover all of the case scenarios as they are read from left to right.

Things to Keep in Mind

Only use indices as you need

  • Where clause only uses 1 column
    • Use single index
  • Where clause has multiple columns
    • Use composite index
  • Multiple where clauses with singles and 1 multiple index
    • Use composite index

The more indices you have, the more overhead you will have on overall disk space and the slower the update and inserts.

The more records you have, the longer it takes to add index after the fact. It will LOCK DATABASE TABLES so if you are releasing some indices on production SWITCH TO MAINTENANCE MODE if you are expecting a large new index job on an existing table with A LOT of production data.

When to Avoid Using Indices

  1. The table is small and will stay that way
  2. Tables with a lot of large batch inserts/updates and not many reads
  3. Columns with a lot of null values
  4. Frequently manipulated columns

Types of Indices

Generally you’ll pick between single or composite index, but sometimes you have more complex search patterns or specific texts within a large text data that may benefit from a XXXX_patterns_ops type of index.

Single Index
  • B-tree index based on 1 table column
Composite Index
  • B-tree index based on multiple table columns
Other Index Types and Operator Classes
  • TRGM Index (PostgresSQL only)
    • Gist or Gin indices for trigram based full text searches
    • query.where("symbol % :symbol", symbol: "TEST")
  • Varchar, Text, Bpchar patterns ops
    • B-tree index searches on varchar, text, and char search
    • query.where("symbol LIKE ?", "hi%")

Caching

While working on the FormulaFolios Data Processing Optimization project, there are various methods of caching with rails that you can try.

In this case, we used low level caching as the method to check whether the account has all of its logs updated from the latest trades data calls a very expensive SQL query so we only want to run it once per account as we update each daily log (hundreds per account).

Rails.cache.fetch("account_has_been_updated_#{account.id}", expires_in: 10.minutes) do
  # EXPENSIVE QUERY CALL
end

Do note that if this call is dependent on the newest state after an update, caching will not work because we’d be storing an outdated state when calculating the latest value.

In the case of my project, this was completely valid as the state it is called is valid until the next overnight job (the next 24 hours) and this query is only called during the overnight job.

Warning about Caching

Caching issues are hard to test and bugs are hard to debugs. Be very careful about caching related bugs as when testing a case scenario, it may seem ok, but a user can cache a buggy state making it hard to diagnose.

Further Studies

There were many other solutions I’ve considered while researching improvements to database speed and optimization to complete the overnight jobs.

Other things that can be considered are the following:

Common scenarios to use partitioning:

  • You have more than 1 million rows, track by range, and know why you want to use it
    • I/O time and missing indices are more likely issues than partitioning

See Rules of Thumb for MySQL Under Partitioning

Sources