Advice on choosing technology for large project (database 600gb + )

Hello guys, I gonna develop soon a project for a client that has a large database (over 600gb) We will need to create a new system and migrate to new database system.

I need consistency. (no missing keys or invalid data) I need uptime no strange bugs etc. I need to do some updates (no deletes) , a lot of reads and rights. I need to generate some statistics so i will need to be able to filter large amounts of data and generate statistics from that data.

Programming language will be php. Anyone have experience with large databases ?

Thank you

wullinkm

Hi Carlos,

Cool, that sounds like a challenging project! It's not so easy to provide a single solution that fits all scenarios. It depends allot on the database as well; how much data is in each table? Another important question is, do you really need such a big database? Often it makes sense to archive data to reporting only databases. Supporting 600 Gb of data with reads and writes is definitely not easy.

Where is the data currently stored? Both MySQL and PostgreSQL are good databases that can handle a lot of data. Migrating that quantity of data to another database is definitely something that will take a long time. So if their data is already in MySQL, it's definitely easier to leave it there.

I hope that sends you a bit in the right way, but if you have some more questions,, hit me up!

0 Likes
๐Ÿ”ฅCarlos Branco Author

Thanks for your comment , I'm thinking about PostgreSQL as well. Maybe also archive old data (like previous years) is a good idea. Thanks again!

0 Likes
Sumit Datta

Hello Carlos, I have been part of teams handling data of these sizes or even larger. Here are some pointers I can share:

  • Size of the database is less important than ingress/egress (write/read) throughput needed
  • Measure the current throughput so you know what you need and might need in future
  • INDEX, Index, index - yes really
  • Find out slow queries, optimize from slowest + most used ones first
  • Use separate search appliance like Solr/Elastic if your users do search
  • If you have budget and are OK with cloud managed database, then get that - they are optimized for loads
  • Egress is easier to scale with read replicas
  • PostgreSQL has support for parallel queries if you need that for high read/query loads
  • Use cache heavily with Memcached/Redis and make sure to fine tune cache invalidation as per your needs
  • Depending on your use case you can do multi-node caches bound to app servers or central cache, whatever works
  • HTTP handlers should not to compute intensive work, use a task queue (RabbitMQ/Redis)
  • If you have async (notifications/chat/etc) patterns or "server push" then embrace async in your application layer
  • Analytics/statistics should not be done on production database, use a data warehouse for that
  • If you are processing data after user input in multiple stages or have streaming data patterns, look at stream processors like Kafka

Hope that helps, Sumit


Edit:

  • Added point on slow queries
0 Likes
๐Ÿ”ฅCarlos Branco Author

Thank your for your advice.

0 Likes
Ryan Glass

This is gold. Good knowledge!

0 Likes
Krunal Shah

I was working on a similar kind of project ( 80GB of DB ). Here is what you can do infrastructure wise.

80GB data was divided into the multiple tables so I made microservices with the Lumen(Swoole) to expose the rest API and consume it with react on the frontend.

  • DB: RDS with different instances for read and write operations.
  • ECS/ECR: To host the lumen-swoole microservices.
  • Amplify: To host the react frontend.
  • S3: To host the static files and DB backups.
  • Redis: ElastiCache to store the reports output.
0 Likes

Please sign in to leave a comment.