When VM Data Goes Missing

It was a tough spot. One of our valued clients, who I've worked with for a while, recently had a major headache: a key virtual machine (VM) on their office server lost important data. Everything was gone – all the VM configuration files and critical binary files. This post will walk you through how I stepped in to help them get everything back, what I learned from the experience, and how I'm now working with them to make sure this never happens again.

The Incident

It all started during what should have been routine virtual machine management at my client's site. Without anyone deleting anything on purpose, a system anomaly led to the inadvertent removal of critical files from an office server VM. This immediately caused problems with their services, disrupted daily internal operations, and raised significant concerns about their data integrity.

The Recovery Process

My recovery plan for the client was clear: to restore the entire VM's functionality, including its configuration, binary files, and any application data like MySQL tables. Just copying files wasn't an option, as crucial setup information was missing. I had to approach this with precision and care.

Getting Ready for Recovery

Initial Assessment and Containment: As soon as I was brought in, my first priority was to quickly assess the full extent of the data loss and immediately contain the incident. This was crucial to prevent any further data corruption or potential spread of the issue.
Set Up the Recovery Environment: I established a safe, isolated recovery environment at the client's location. First, I took a backup of what was left on the affected system. This involved securing all their existing backups too, and making sure I had the right permissions to do the work without affecting their live systems.

1. Attempting Direct Disk Recovery and VM Reconstruction

This phase involved my initial, fastest attempt at recovery directly from the affected disk, followed by VM reconstruction if needed.

Attempting Direct File Recovery with TestDisk (My First Line of Defense):

My client's last full VM backup was 15 days old, meaning a direct restore would lead to a significant data loss gap. To minimize this, I first attempted to recover files directly from the deleted disk. This was my fastest path to potentially recovering the most recent data.

The reason this approach holds promise is due to how file deletion typically works in Linux systems. When a file is "deleted" on a Linux filesystem, the data itself isn't immediately erased from the disk sectors. Instead, the operating system primarily removes the file's metadata (like its name, location, and size) from the filesystem's index. The sectors where the data resides are merely marked as available for new data. Until new data overwrites those sectors, the original data might still be present.

I utilized TestDisk, a powerful free and open-source data recovery utility, designed to recover lost partitions and make non-booting disks bootable again, but it's also highly effective at recovering deleted files.

Here's a simplified overview of how I used TestDisk:
- Installation: TestDisk was installed on a separate live Linux environment.
```
  sudo apt-get install testdisk
```
- Disk Selection: I launched testdisk and selected the affected disk drive where the VM data was lost.
- Partition Analysis: TestDisk was instructed to analyze the partition structure, looking for lost partitions or deleted files.
- File Recovery: I navigated through the detected filesystem structure, searching for the critical VM configuration files and binary files that were reported missing. TestDisk attempts to scan the unallocated sectors and rebuild the file metadata, allowing for recovery.

Outcome of TestDisk:

Unfortunately, despite my best efforts and TestDisk's capabilities, in this specific instance, the tool did not yield the desired results for the critical VM configuration and and binary files. This suggested that the sectors containing the crucial metadata or parts of the files might have already been overwritten or were too fragmented for a complete recovery in this complex VM environment. While this was a setback, it was a necessary and rapid first attempt to minimize data loss.

2. Identifying Post-Backup Changes and Data Location

After TestDisk didn't provide a full recovery, I shifted focus to identifying what data might have changed since the 15-day-old backup. I determined that:

Most of the core system configuration files would likely be the same as in the backup.
The Gitea repository (for code) and its commit data would definitely have changed.
The MySQL database data would also have been updated.

Luckily, I found that the /var/lib directory on the corrupt disk still contained the data for both MySQL and Gitea. This was a crucial discovery, as these directories typically store the application's persistent data.

3. VM Reconstruction and Configuration Restoration:

Since a direct recovery wasn't fully successful, I proceeded to recreate the VM environment.

New VM Setup: I set up a brand new Ubuntu 22 VM (matching the client's original operating system) to serve as the recovery target.
Package Installation: All necessary software packages were installed on this new VM.
Configuration Replacement: I then replaced the default configuration files on the new VM with the configuration files recovered from the 15-day-old backup. This brought the system's core settings back to a known good state.

Firewall Rules and Gitea Data Recovery:

Firewall Rules: The client's firewall rules were well-documented in their change logs. I manually recreated these rules on the new VM, a process that did not take much time.
Gitea Data Recovery: For the Gitea application, I simply replaced its data directory on the new VM with the Gitea data directory I had recovered from the corrupt disk's /var/lib location. After carefully managing file permissions and ownership, database recovery was the only thing left to bring Gitea and other services online with its latest data.

4. MySQL Data Recovery – The Core Challenge

After the configurations were restored, the most important task was recovering the MySQL data. This required a detailed approach, especially since the .ibd files were crucial and directly copying them wasn't straightforward due to missing metadata.

Database Structure Re-initialization:

The foundational step was to re-establish the database's structural integrity. I did this by importing an SQL dump from the older, 15-day-old backup. This recreated all database schemas, table definitions, and initial configurations.
```
 sudo mysql < backup.sql
```
User Privilege Configuration:

Following the schema restoration, I recreated the necessary database user accounts for the client and reinstated their respective privileges. All user credentials were securely retrieved from backed-up configurations files.
Understanding InnoDB and .ibd File Restoration:

To effectively restore the client's MySQL database, I leveraged my basic understanding of the MySQL InnoDB storage engine, particularly its transactional properties and recovery mechanisms. This was crucial for handling the .ibd files.
- Temporarily Disable Database Rules:
  
  To ensure a smooth import process and prevent any integrity errors, I temporarily turned off foreign key checks.
```
  SET FOREIGN_KEY_CHECKS = 0;
```
- Discard Existing Data Links (Tablespaces): To prepare for the seamless import of the new .ibd files, I first told the database to forget where its current data files were. This effectively unlinks the table definitions from their associated data files, clearing the way for the new ones.
  
  Important Note: This operation disconnects the current data. I always ensure my clients have a complete and verified backup before performing this step!
```
  SELECT CONCAT('ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` DISCARD TABLESPACE;') AS sql_command FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = <DatabaseName> AND ENGINE = 'InnoDB';
```
  I executed each command generated by this query individually to maintain precise control.
- Copy and Secure the Data Files:
  
  After discarding the old links, I carefully copied recovered .ibd files from the corrupt disk into the corresponding database folder on the new VM. It was crucial to set the right ownership for these files. Incorrect permissions would have prevented MySQL from accessing them.
```
  /old/var/lib/mysql/<DatabaseName>/ # Source path on the corrupt disk
  chown mysql:mysql /var/lib/mysql/<DatabaseName>/*
```
- Import the New Data (Tablespaces):
  
  With the .ibd files correctly positioned and permissions accurately configured, the next step was to instruct MySQL to use these new data files. This action successfully linked the table designs with their actual data, bringing them back online.
```
  SELECT CONCAT('ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` IMPORT TABLESPACE;') AS sql_command FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = <DatabaseName> AND ENGINE = 'InnoDB';
```
  I executed the resulting commands sequentially for each table.
- Re-enable Database Rules: Upon the successful import of all data, I promptly turned the foreign key checks back on. This restored the rules that maintain data consistency and integrity within the database for all future operations.
```
  SET FOREIGN_KEY_CHECKS = 1;
```

5. Addressing Advanced Recovery Challenges (Schema Mismatches):

A significant hurdle emerged when I discovered that 19 out of 127 tables of "Gitea" had different schema. This was primarily due to recent updates to the Gitea application, which prevented a straightforward .ibd file restoration for these specific tables.

Leveraging `ibd2sql` for Data Extraction:

For tables with schema discrepancies or where standard recovery methods were ineffective, I employed a specialized tool called ibd2sql. This utility is designed to parse InnoDB .ibd files and extract the contained data as SQL INSERT statements, even when a complete MySQL instance or .frm files are unavailable.

cd ibd2sql
for path in $(realpath /old/var/lib/mysql/<DatabaseName>); do python3 main.py $path --sql --ddl | mysql; done

Important Note: While ibd2sql is a powerful recovery tool, it may not fully support all complex data types, and I always recommend thorough post-recovery data validation.

In instances where ibd2sql provided table designs that didn't perfectly align with the current schema, manual intervention was required. I meticulously consulted the client's application's (Gitea's) official database schema changelogs and migration scripts. These resources provided the precise Data Definition Language (DDL) for each table at various points in time, allowing me to accurately reconstruct the correct table definitions.

After carefully modifying the table designs to reflect the correct schema gleaned from the changelogs, I tested them. This involved creating temporary tables and performing small-scale data imports to ensure the design precisely matched the .ibd file structure and that data integrity was fully maintained. This meticulous manual process, combined with leveraging historical schema information, was crucial for successfully recovering data from files that initially seemed lost.

6. Validation and Final Checks

System Integrity Checks: After all files were restored, we performed comprehensive checks to ensure the VM and all its applications were running correctly and without errors. This included verifying file integrity and system services.
Data Validation: For critical applications like the database and Gitea, I worked closely with the client to validate the restored data, ensuring its accuracy and completeness.

In Conclusion

The VM data loss incident at my client's site was undoubtedly a demanding experience, but it also proved to be a profound learning opportunity for both them and me. It underscored the critical importance of proactive prevention, meticulous planning, and the ability to maintain composure and execute methodically under pressure. Getting their data back successfully has strengthened their reliance on robust data management practices, and I am proud to have been their partner in this recovery. By implementing these reinforced best practices, I am confident that I can help my clients turn potential data catastrophes into manageable incidents, ensuring their services run smoothly and their data remains safe.

When VM Data Goes Missing

The Incident

The Recovery Process

Getting Ready for Recovery

1. Attempting Direct Disk Recovery and VM Reconstruction

Outcome of TestDisk:

2. Identifying Post-Backup Changes and Data Location

3. VM Reconstruction and Configuration Restoration:

Firewall Rules and Gitea Data Recovery:

4. MySQL Data Recovery – The Core Challenge

5. Addressing Advanced Recovery Challenges (Schema Mismatches):

Leveraging `ibd2sql` for Data Extraction:

6. Validation and Final Checks

In Conclusion

Comments

More from this blog

A Deep Dive into Langfuse

Self-Hosting Deepseek with GPU Passthrough on Proxmox

Unattended Upgrades: A Love-Hate Relationship?

Backup-Induced CPU Spike in Guest VMs: Proxmox

Command Palette

The Incident

The Recovery Process

Getting Ready for Recovery

1. Attempting Direct Disk Recovery and VM Reconstruction

Outcome of TestDisk:

2. Identifying Post-Backup Changes and Data Location

3. VM Reconstruction and Configuration Restoration:

Firewall Rules and Gitea Data Recovery:

4. MySQL Data Recovery – The Core Challenge

5. Addressing Advanced Recovery Challenges (Schema Mismatches):

Leveraging ibd2sql for Data Extraction:

6. Validation and Final Checks

In Conclusion

Comments

More from this blog

Leveraging `ibd2sql` for Data Extraction: