The Data Profiling / Statistics window is used to check the quality of your data.

It will present you with a complete set of statistics which you can use to help clean and correct your data, and to prepare it better for data matching.



Highlighted are potentially very important issues

Highlighted are potentially less important issues


On the Filled and Empty % values, you can right-click then choose "Filter on data table" which will then take you back to the DATA section and filter on either the Filled or Empty values. 

From there you can enter missing values or export the filtered selection:



On the other values, you can double-click each value to view all those on the Filter Values Window and then use the Cleaning Matrix to remove or replace any of the irrelevant characters.


You can export all the statistics data using the Export Statistics button (located on the Clean menu ribbon).



Description for each statistic:


Column Name - Name of column from the selected data table

Type - The declared data type for the column

Filled - The count of records that contain any data

Empty - The % of records that are blank

Distinct - The count of all unique values

Trailing Spaces - Number of records that have a trailing spaces (e.g. "John Smith ")

Commas - Number of records that contain a comma (e.g. "10, Main Street")

Dots - Number of records that contain dots (e.g. "New.York")

Hyphens - Number of records that contain hyphens (e.g. "0986-5652")

Apostrophes -  Number of records that contain apostrophes (e.g. "John's Business")

Leading Spaces - Number of records that have a leading spaces (e.g. " John Smith")

Letters - Number of records that only contain letters

Numbers - Number of records that only contain numbers

Non Printables - Number of records that contain non-printable characters. Non-printable characters are parts of a character set that do not represent a written symbol or part of the text within a document or code, but rather are there in the context of signal and control in character encoding. Non-printable characters are used to indicate certain formatting actions, such as: White spaces (considered an invisible graphic), Carriage Returns, Tabs, Line Breaks, Page Breaks and Null characters

With Spaces - Number of records that have any space

Multiple Spaces - Number of records that have more than one spaces (e.g. " John Smith     ")

New Line Char - Number of records that contain a new line character

Tab Char - Number of records that contain a tab character

Punctuation - Number of records that contain punctuation marks. Punctuation marks are: period, comma, question mark, hyphen, dash, parentheses, apostrophe, ellipsis, quotation mark, colon, semicolon, exclamation point

Upper Only - Number of records that contain Upper case only characters (e.g. "JOHN SMITH")

Lower Only - Number of records that contain Lower case only characters (e.g. "john smith")

Proper Case -Number of records that contain both Upper and Lower case in a standardized format (e.g. "John Smith")

Mixed Case - Number of records that contain both Upper and Lower case which are mixed together (e.g. "JoHN SmiTH)

Most Common - The most common value within the column

Most Common Count - The most common count within the column

Min Number - The lowest number within that column

Max Number - The highest number within that column

Max Words - The maximum number of words

Average Words - The average count of words

Max Length - The maximum length of words

Average Length - The average length of words