Data Profiling/Statistics
The Data Profiling / Statistics window is used to check the quality of your data.
It will present you with a complete set of statistics which you can use to help clean and correct your data, and to prepare it better for data matching.
Highlighted are potentially very important issues
Highlighted are potentially less important issues
On the Filled and Empty % values, you can right-click then choose "Filter on data table" which will then take you back to the DATA section and filter on either the Filled or Empty values.
From there you can enter missing values or export the filtered selection:
On the other values, you can double-click each value to view all those on the Filter Values Window and then use the Cleaning Matrix to remove or replace any of the irrelevant characters.
You can export all the statistics data using the Export Statistics button (located on the Clean menu ribbon).
Description for each statistic:
Column Name - Name of column from the selected data table
Type - The declared data type for the column
Filled - The count of records that contain any data
Empty - The % of records that are blank
Distinct - The count of all unique values
Trailing Spaces - Number of records that have a trailing spaces (e.g. "John Smith ")
Commas - Number of records that contain a comma (e.g. "10, Main Street")
Dots - Number of records that contain dots (e.g. "New.York")
Hyphens - Number of records that contain hyphens (e.g. "0986-5652")
Apostrophes - Number of records that contain apostrophes (e.g. "John's Business")
Leading Spaces - Number of records that have a leading spaces (e.g. " John Smith")
Letters - Number of records that only contain letters
Numbers - Number of records that only contain numbers
Non Printables - Number of records that contain non-printable characters. Non-printable characters are parts of a character set that do not represent a written symbol or part of the text within a document or code, but rather are there in the context of signal and control in character encoding. Non-printable characters are used to indicate certain formatting actions, such as: White spaces (considered an invisible graphic), Carriage Returns, Tabs, Line Breaks, Page Breaks and Null characters
With Spaces - Number of records that have any space
Multiple Spaces - Number of records that have more than one spaces (e.g. " John Smith ")
New Line Char - Number of records that contain a new line character
Tab Char - Number of records that contain a tab character
Punctuation - Number of records that contain punctuation marks. Punctuation marks are: period, comma, question mark, hyphen, dash, parentheses, apostrophe, ellipsis, quotation mark, colon, semicolon, exclamation point
Upper Only - Number of records that contain Upper case only characters (e.g. "JOHN SMITH")
Lower Only - Number of records that contain Lower case only characters (e.g. "john smith")
Proper Case -Number of records that contain both Upper and Lower case in a standardized format (e.g. "John Smith")
Mixed Case - Number of records that contain both Upper and Lower case which are mixed together (e.g. "JoHN SmiTH)
Most Common - The most common value within the column
Most Common Count - The most common count within the column
Min Number - The lowest number within that column
Max Number - The highest number within that column
Max Words - The maximum number of words
Average Words - The average count of words
Max Length - The maximum length of words
Average Length - The average length of words