from PoisonedPrisonPanda@discuss.tchncs.de to programming@programming.dev on 02 May 22:39
https://discuss.tchncs.de/post/35558678
hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.
the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.
the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?
edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.
threaded - newest
sounds like some changes would be a good idea đ
haha. relating to a switch to ssd? or in which direction?
sounds like lots of directions:
yes. the problem is, we are fetching products from an API. and since we thought processing power will be a limiting factor, we thought that sorting out duplicates would reduce load.
but since the different microservices which process the data are taking different times we are using the sql tables as a pool. this should help upcscaling by using multiple microservices.
cloud services are yet not a solution as we are still in development.
This is an exceptionally good answer and youâre doing everything possible to avoid doing it, when you could have been half way done with the book by now probably. Database administration is a profession, not a job. It requires specialized training to do it well and doing everything possible to avoid that training and knowledge wonât help you one bit.
It doesnât matter. Your database is very complex.
You search 10 million records on every request and you wonder why itâs slow?
No. Database administration is very difficult. Reading that 2000 page book is essential for setting up infrastructure to avoid a monolithic setup like this in the first place.
lol wtf
Realistically, this setup is 10 years too old. How large is your database? Is there any reason why it canât be run in memory? 10 million lines isnât insurmountable. Full text with a moderate number of tables could be ~10GBâno reason that canât be run in memory with Redis or other in-memory database or to update to a more modern in-memory database solution like Dice.
Your biggest problem is the lack of deduplication and normalization in your database design. If itâs not fixed now, itâll simply get worse YOY until itâs unusable. Either spend the time and money now, or spend even more time and money later to fix it. đ¤ˇââď¸
tl;dr: RTFM.
Sort of harsh approach, but I get it.
Though I did learn the most while having a lot of data and had issues with performance.
Studying Postgres in that job was the absolute best part, I learned so much, and now I canât find a problem Postgres canât fix.
There was a running joke in my last office that I was paid to promote Pg because every time MySQL fucked something up, I would bring up how Postgres would solve it. I even did several presentations.
Then we migrated to Postgres and suddenly everything is stable as a rock, even under worse conditions and way more data.
I just love Postgres so much.
Sometimes it feels like postgres is cheating (in a good way)
Compared to MySQL most definitely.
Granted, Oracle has pushed some fresh air into it, but still it has a long way to go.
To paraquote H. L. Mencken: For every problem, there is a solution thatâs cheap, fast, easy to implement â and wrong.
Silver bullets and magic wands donât really exist, Iâm afraid. Thereâs amble reasons for DBAâs being well-paid people.
Thereâs basically three options: Either increase the hardware capabilities to be able to handle the amount of data you want to deal with, decrease the amount of data so that the hardware youâve got can handle it at the level of performance you want or⌠Live with the status quo.
If throwing more hardware at the issue was an option, I presume you would just have done so. As for how to viably decrease the amount of data in your active set, well, thatâs hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series? If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage? Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?
While I get that SO can be monstrously unhelpful, database optimization is a whole profession so I think we need a bit more to help
A few directions we could go here: Post your SQL query. This could be a structure or query issue. Best case, we could do some query optimization. Also, have you looked into indexing?
Where are your bottlenecks coming from? Is your server desined for a I/O intensive workload like databases. Sequential read speed is not a good metrix.
What about concurrency? If this is is super read/write intensive, optimization could depend on where data is written while youâre reading
What is the execution path? What indexes are being hit? What are the keys? Can you separate things by, for example, category since dupes wouldn't typically exist there? There are lots of potential things that might tell you more or improve performance, but this is super vague.
Database performance tuning is its own little world and there are lots of complexities and caveats depending on the database system.
With MSSQL, the first thing you should check is your indexes. You should have indexes on commonly queried fields and any foreign keys. Itâs the best place to start because indexing alone can often make or break database performance.
What? Problems like this usually come down to some missing indexes. Can you view the query plan for your slow queries? See how long they are taking? IDK about SQL Server but usually there is a command called something like ANALYZE, that breaks down a query into the different parts of its execution plan, executes it, and measures how long each part takes. If you see something like âFULL TABLE SCANâ taking a long time, that can usually be fixed with an index.
If this doesnât make any sense to you, ask if there are any database gurus at your company, or book a few hours with a consultant. If you go the paid consultant route, say you want someone good at SQL Server query optimization.
By the way I think some people in this thread are overestimating the complexity of this type of problem or are maybe unintentionally spreading FUD. Iâm not a DB guru but I would say that by now Iâm somewhat clueful, and I got that way mostly by reading the SQLlite docs including the implementation manuals over a few evenings. Thatâs probably a few hundred pages but not 2000 or anything like that.
First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.
Also, look at your slowest queries. They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query? If just one, you probably need an index; if two or three, it might take a bit of head scratching; and if 4 or more, something is possibly wrong with your schema or how the queries are written and you have to straighten that out.
Basically from having seen this type of thing many times before, there is about a 50% chance that it can be solved with very little effort, by adding indexes based on studying the slow query executions.
Ms sql is trash
Indexes are great but probably donât get you far if it is already really slow.
Running anything on a Hdd is a joke
You read write and compare continuously? Did you try to split it into smaller chunks?
Iâd prefer MS SQL over Oracle SQL any day. And PG SQL over both of them.
could you tell me what book it is đ
Lotta smarter people than me have already posted better answers in this thread, but this really stood out to me:
Why arenât you de-duping the table before processing? Whatâs inserting these duplicates and why are they necessary to the table? If they serve no purpose, find out whatâs generating them and stop it, or write a pre-load script to clean it up before your core processing queries access that table. Iâd start here - it sounds like whatâs really happening is that youâve got a garbage query dumping dupes into your table and bloating your db.
Indexes and pagination would be good starts
Do you remember the part of education where they talked about tradeoffs? How making decision a means x, y, x good things and a, b, c bad things? Because itâs reading strongly like your system design methodology was âthis is the path of least resistance so Iâm doing thatâ.
Most code is not complex. Good code is usually very easy to read and understand.
Just because you can read and understand the queries you wrote doesnât mean theyâre efficient or that youâre using good design.
So yes. Stack Overflow is going to tell you to RTFM. Because someone needs to sit down with this mess, determine the pros and cons of the system design, and figure out where to start overhauling.
If you are new to something and want to learn, seek resources and educate yourself with them. Learning takes time, and there are no shortcuts.
A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you canât, consider getting a new server and migrating to it.
SQL server can generate execution plans for you. For your queries, generate those, and see if youâre doing any operations that involve iterating the entire table. You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.
If you want to do some kind of dupe protection, let the DB do it for you. Create an index and a table constraint on the relevant columns. If the data is too complex for that, find a way to do it, like generating and storing hashes, sorting lists/dicts, etc just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).
For read-heavy workflows, consider whether caches or read replicas will benefit you.
And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you donât take the time to educate yourself on it.
âThey simply go through the whole tableâ⌠thatâs the problem. A full table scan should be avoided at all costs.
Learn: how to run and read an explain plan, indexes, keys, constraints, and query optimization (broadly you want to locate individual records as quickly as possible by using the most selective criteria).
You also need to learn basic schema design and to familiarize yourself with normalization.
Avoid processing huge result sets in your application. The database is good at answering questions about data it contains. It isnât just a big bucket to throw data into to retrieve later.