Wednesday, December 19, 2018

Statistically Manufactured Personal Data

To avoid the trouble of dealing with personal data when we test our analytics processes, I have created mock personal data that closely reflect American population from statistical point of view. The largest data set has 1 million records with variables including first name, last name, sex, date of birth, social security number, address, phone number and email. The values of these variables are produced to be as realistic as possible to real American population. They represents about 0.33% of population in the United States.

These observations about the data 1 million mock personal data records are very close to the real statistics of the population in USA.

1.The top 4 states that have the most people are: California(138223 persons, %13.82), Texas(99217 persons, %9.92), Florida(69640 persons, %6.96) and New York(49979 persons, %5). These are close to the real distribution of the population in USA.
2. The female are 51% and the male are 49%.
3. Top 3 last names are Smith(10800 persons, %1.08), Williams(8000 persons, %.8) and Jones(6900 persons, %.69).
4. Top 3 female first names are Ava(4707 persons, %.93), Olivia(4508 persons, %.89) and Isabella(4311 persons, %.85) and top 3 male first names are Noah(5075 persons, %1.03), Elijah(4736 persons, %.96) and Liam(4434 persons, %.9).
5. The following table shows distributions of persons by age for both sexes. Women live longer than men.
                        Female           Male
Age Group        #        %       #  % 
   .Under 5 years 34603 6.81% 35656 7.25%
   .5 to 9 years 34707 6.83% 34010 6.92%
   .10 to 14 years 30192 5.94% 33013 6.72%
   .15 to 19 years 34361 6.76% 32689 6.65%
   .20 to 24 years 32512 6.39% 36647 7.45%
   .25 to 29 years 35626 7.01% 37278 7.58%
   .30 to 34 years 34344 6.76% 31977 6.50%
   .35 to 39 years 33325 6.55% 31927 6.49%
   .40 to 44 years 33332 6.56% 34456 7.01%
   .45 to 49 years 35070 6.90% 35443 7.21%
   .50 to 54 years 37321 7.34% 34876 7.09%
   .55 to 59 years 31623 6.22% 31315 6.37%
   .60 to 64 years 28801 5.67% 24218 4.93%
   .65 to 69 years 20999 4.13% 19881 4.04%
   .70 to 74 years 16617 3.27% 14065 2.86%
   .75 to 79 years 13520 2.66% 10272 2.09%
   .80 to 84 years 10693 2.10% 7983 1.62%
   .85 years and over 10754 2.12% 5894 1.20%
You may download a small file with 100 records free here. Free Download. Files with 5k, 50K, 250K and 1 million records are available for purchase at
File Name Description Price Buy
dm_mock_person_100.csv 100 mock personal data records. CSV format. free Free Download
dm_mock_person_5k.csv 5K mock personal data records. About 0.7M bytes. CSV format. $2.95  
dm_mock_person_50k.csv 50K mock personal data records. About 7M bytes. CSV format. $7.95  
dm_mock_person_250k.csv 250K mock personal data records. About 35M bytes. CSV format. $9.95  
dm_mock_person_1m.csv 1 million mock personal data records. About 140M bytes. CSV format. $39.95  

No comments: