This is based on one of the tech quarterly meetings I had during FormulaFolios to present my solution from the FormulaFolios Data Processing Optimization research and fix on September 2020. The following process can be done on MySQL or PostgresSQL but I reckon this can be done with any relational based databases.
Process of Database Fine Tuning
- First we diagnose what the problem is.
- Second we analyze the costs and types of the queries.
- Finally we implement our fixes.
Diagnose 🧪
You want to look for the problematic queries first. Ask yourself the following questions:
- How slow are they?
- How often are they called?
- Does it affect users?
When checking how slow the query are, you can use the explain or EXPLAIN ANALYZE
on your SELECT
query to see the cost analysis of each query in details.
If you’re on Ruby on Rails, Rails have a built in explain method built into ActiveRecord. If you can’t access your database terminal and are in your rails console instead, you can run it like the following:
Model.where(type: "test", power: 3).joins(:model_types).explain
It’ll return you the query and a table with the cost analysis.
Analyze 🔬
Cost is the most important thing to check. Total time is what will matter for most queries.
Let’s take the following query as an example:
EXPLAIN ANALYZE
SELECT "products".* FROM "products" WHERE (symbol = 'SPY')
You’ll probably get something like:
Seq Scan on products (cost=0.00..33.98 rows=1 width=99)
Filter:: ((symbol)::text = 'SPY'::text)
Cost is arbitrary, but it is equivalent to disk page fetches. The first number is the start-up cost and the second number is the total cost to return all the rows.
In general, for most queries, the total cost is what matters.
Types of Scans
Further readings on the types of scans can be found on https://www.cybertec-postgresql.com/en/postgresql-indexing-index-scan-vs-bitmap-scan-vs-sequential-scan-basics/ if you’re interested in further details in regards to SQL scans and their performances, but I’ll summarize the basic 3 while analyzing.
Sequential Scan
- Standard O(n) iteration of 1 row at a time
- Used when no indices are implemented or majority of the table will be returned
Bitmap Scan
- Select a lot of rows, but not enough to filter out the entire table to use Sequential
- Runs in logarithmic time
Index Scan
- Only returns a small handful of rows selected from a table
- Runs in O(1) time (constant)
Notes in Regards to Scans
If you don’t have many data, explain will most likely always use a sequential scan.
If the query is already using an index or bitmap scan, then it is already optimized.
What to Fix
Knowing what you know about scans now, ask yourself the following questions about what queries you should prioritize:
- Is it a scheduled job that runs in the middle of the night and doesn’t bring a server down?
- Not a problem
- Is it something that retrieves data on a page that is used with high volume of traffic whenever a user clicks a button?
- Problematic
- Imagine the multiplied domino effect causing the job to be stuck in the queue until all the resources are exhausted
- This will definitely affect your website negatively causing slowdowns or even bringing it down
An important rule of thumb is to “forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
Unless something needs to be optimized, we do not need to prematurely optimize everything as developer time, research, and efforts are expensive delaying products from being shipped when we are not sure whether the product in question will be short lived (deleted later in the future) or built to stay.
Fixes 🔧
There are several options to improve database performances:
- Insert indices for the offending table
- Cache query to avoid multiple large calls to the database
- Partition the tables
- Reduce frequency the table is called if possible
- Optimize your queries
The solution I utilized while optimizating the data processing jobs was a combination of inserting indicies and implementing a cache on an expensive query. I’ll go over the types of indices and some of the pros/cons as well as caching.
Indexing
Generally if you’re going to look up a certain table with a certain key, you’ll want to have an index on that. If you’re going to use multiple keys to lookup, it is best to have a composite index. Composite index reads from left to right.
For example when designing a certain table and a certain callback always make these queries:
SELECT * FROM personnels WHERE staff_id=1 AND office_id=2 AND department_id = 42
SELECT * FROM personnels WHERE staff_id=3 AND office_id=3
SELECT * FROM personnels WHERE staff_id=2
We’ll want a composite index on the following in this order:
staff_id
office_id
department_id
It would be redundant to have an index on each of them alone or even 2 of them together as the composite index that includes 3 will cover all of the case scenarios as they are read from left to right.
Things to Keep in Mind
Only use indices as you need
- Where clause only uses 1 column
- Use single index
- Where clause has multiple columns
- Use composite index
- Multiple where clauses with singles and 1 multiple index
- Use composite index
The more indices you have, the more overhead you will have on overall disk space and the slower the update and inserts.
The more records you have, the longer it takes to add index after the fact. It will LOCK DATABASE TABLES so if you are releasing some indices on production SWITCH TO MAINTENANCE MODE if you are expecting a large new index job on an existing table with A LOT of production data.
When to Avoid Using Indices
- The table is small and will stay that way
- Tables with a lot of large batch inserts/updates and not many reads
- Columns with a lot of
null
values - Frequently manipulated columns
Types of Indices
Generally you’ll pick between single or composite index, but sometimes you have more complex search patterns or specific texts within a large text data that may benefit from a XXXX_patterns_ops
type of index.
Single Index
- B-tree index based on 1 table column
Composite Index
- B-tree index based on multiple table columns
Other Index Types and Operator Classes
-
TRGM Index (PostgresSQL only)
- Gist or Gin indices for trigram based full text searches
query.where("symbol % :symbol", symbol: "TEST")
-
Varchar, Text, Bpchar patterns ops
- B-tree index searches on varchar, text, and char search
query.where("symbol LIKE ?", "hi%")
Caching
While working on the FormulaFolios Data Processing Optimization project, there are various methods of caching with rails that you can try.
In this case, we used low level caching as the method to check whether the account has all of its logs updated from the latest trades data calls a very expensive SQL query so we only want to run it once per account as we update each daily log (hundreds per account).
Rails.cache.fetch("account_has_been_updated_#{account.id}", expires_in: 10.minutes) do
# EXPENSIVE QUERY CALL
end
Do note that if this call is dependent on the newest state after an update, caching will not work because we’d be storing an outdated state when calculating the latest value.
In the case of my project, this was completely valid as the state it is called is valid until the next overnight job (the next 24 hours) and this query is only called during the overnight job.
Warning about Caching
Caching issues are hard to test and bugs are hard to debugs. Be very careful about caching related bugs as when testing a case scenario, it may seem ok, but a user can cache a buggy state making it hard to diagnose.
Further Studies
There were many other solutions I’ve considered while researching improvements to database speed and optimization to complete the overnight jobs.
Other things that can be considered are the following:
- Sharding vs Partitioning
Common scenarios to use partitioning:
- You have more than 1 million rows, track by range, and know why you want to use it
- I/O time and missing indices are more likely issues than partitioning
See Rules of Thumb for MySQL Under Partitioning
Sources
- Guide for explaining SQL indices to developers
- Utilizing PostgreSQL trigram index for faster text search (PostgresSQL only)
- PostgreSQL health analysis tools
- Meanings behind all the metrics & numbers for measuring query performance
- Types of scans (sequential, bitmap, index)
- Useful tips, tricks, and suggestions for improving database performances