Schema migrations in the relational world are now common practice. The best practices for evolving a database schema are well known, where a migration gets applied before the code that needs to use it is rolled out. If there are any problems, migration can be rolled back. The same practices are not as well established in Big Data world. Part of the reason may be the diversity of tools and usecases, another reason may be scale – a migration on petabyte scale is not something that can be casually approached.
One problem I see a lot is the desire to keep the database schema in version control. There are established best practices around version control tools, such as code review and continuous integration, and it would be nice to keep the schema for the data warehouse to conform to those practices. However, inevitably the schema gets out of sync with the production version – someone creates a view and doesn’t check it in, someone else gives more permissions to a user group, etc. So the goal is to ensure the version controlled schema always matches the current live schema.
Most data warehouses provide a way to view the schema. Here is how to do it in Hive:
SHOW CREATE TABLE tablename
Doing the same in Amazon Redshift is a little more complicated, but can be accomplished using this helpful SQL view. These operate on a table, but you could set up a simple script that iterates over all tables or views to generate similar output along with the necessary GRANT statements. The goal is to have the script that could be run to regenerate the data catalog, if that becomes necessary.
At this point the advantages of having the script may not be clear – after all the database is the source of truth, and can be restored from a backup in an emergency. True, but the advantage of the scripts being in source control is that they can now go through the standard change process that you use (pull requests, code review, corresponding the change to a ticket, etc). In addition, the changes can be carried along with other code changes, for example to ETL pipelines feeding these tables, giving better visibility.
The last step is to set up verification that the source repository is in sync with production catalog. This can be done with a Continuous Integration (CI) server of your choice. A simple Jenkins job that runs a diff between the script output and version controlled file every night (or every hour), and fails the build if the files don’t match. The trickiest part here is setting up credentials in a secure manner, since CI systems are typically isolated from data catalogs.
There are further steps that can be taken at this point, though this is as far as I’ve seen it taken. Further automation is possible, but typically not worth the effort given the number of tables & administrators involved. One possible improvement is building a staging version of the data warehouse (without all the data) using CI jobs, running ALTER
commands on it, and verifying the final state matches expected one. This would add further safety to the schema evolution process and catch mistakes that slip through code review.
Have you done something similar at your company, or have you found an alternative solution to this problem? I’d love to hear about it!