Skip to content

A demonstration of data subsetting and data masking using a sample MySQL database

Notifications You must be signed in to change notification settings

brencoder/data_subsetter

 
 

Repository files navigation

Data subsetter

Uses MySQL, Bash and Python to create a new database, named subset, that contains a subset of the fictitious employee data from the employees database.

The size of the subset is measured by the number of employees that are copied across. The size is read from the command line.

Click here for an entity-relationship diagram of the employees database.

Author

The files in the root directory were created by Brendon Kwan.

The files in the test_db directory were created by Giuseppe Maxia (datacharmer).

Where the data comes from

The fictitious data was taken from the GitHub repository named datacharmer/test_db, which was distributed under the Creative Commons Attribution-Share Alike 3.0 Unported License.

The employee data consists of about 300,000 employee records with 2.8 million salary entries.

Prerequisites

  1. Install MySQL.

  2. Install Python 3.

  3. Install the Python package named 'MySQL Connector/Python'.

     $ pip install mysql-connector-python
    
  4. Create a MySQL configuration file at ~/.my.cnf containing the following text:

     [client]
     user=<USERNAME>
     password=<PASSWORD>
    

    Where <USERNAME> and <PASSWORD> are replaced with your MySQL database username and password, respectively.

Installation:

  1. Download the repository.

  2. Change directory to the repository.

  3. Create and test the employees database. If the database exists, it will be recreated.

     $ cd test_db
     $ mysql < employees.sql
     $ mysql --table < test_employees_md5.sql
    
  4. Create and test an empty database named subset with the same structure as the employees database.

     $ cd ..
     $ ./create_empty_subset.sh
     $ mysql --table < test_subset_empty.sql
    
  5. Copy the given number of employees, N, from the employees database to the subset databse.

     $ ./subsetter.py N
    

LICENSE

This work is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

About

A demonstration of data subsetting and data masking using a sample MySQL database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • PLpgSQL 77.9%
  • Python 15.2%
  • Shell 6.9%