Ads 468x60px

Friday, October 5, 2012

Formula Forensics No. 030 – Extracting a Sorted, Unique List, Grouped by Frequency of Occurrence

Formula Forensics No. 030 – Extracting a Sorted, Unique List, Grouped by Frequency of Occurrence:
This post is the first of hopefully many posts by Guest author Sajan.

Excel offers many ways to sort and group data. (If you have not explored Pivot Tables in Excel, I would highly encourage you to try them out.) However, sometimes it is necessary to be able to control the results using a formula.
The following is a technique to extract a sorted, unique list of items, displaying the most frequently occurring items first, while restricting the output based on some additional criteria.
As always at Formula Forensics you can follow along with a sample file Download Here Excel 2007-13

The Formula

=INDEX(List, MATCH(MIN(MODE.MULT(IF(Criteria*NOT(COUNTIF($E$1:$E1, List)), (COUNTIF(List, “<”&List)+1)*{1,1}))), IF(Criteria,COUNTIF(List, “<”&List)+{1}), 0))
Entered into cell E2 with Ctrl+Shift+Enter, and copied down.
(We will add in error checking later.)

Sample results can be seen in following figure:

List” is a Named Formula for the source list. (A2:A13 in the example shown.)
Criteria” is a Named Formula for the criteria to apply against the list. For example, (List<> “”)

Disclaimer: Since all of these formulas traverse the source lists, they can get very slow when applied to large lists. I am sharing the formulas more to illustrate the techniques than to endorse them as approaches for every situation. Please determine the suitability for your specific situation.

Before I explain the formula, let us start with some history!

Chandoo’s Technique

In an October 2008 article, Chandoo described an ingenious technique of using the COUNTIF() function to sort a list.
http://chandoo.org/wp/2008/10/22/sorting-text-in-excel-using-formulas/

Oscar’s formula

Oscar Cronquist took it to the next level by describing a formula to create a sorted list using the same technique, in his March, 2009 article:
http://www.get-digital-help.com/2009/03/27/sorting-text-cells-using-array-formula/
=INDEX(List, MATCH(SMALL(COUNTIF(List, “<”&List), ROW(1:1)), COUNTIF(List, “<”&List), 0))
Entered into cell B1 with Ctrl + Shift + Enter, and copied down.
For example, the above formula turns {“DD”; “AA”; “QQ”; “CC”} into {“AA”; “CC”; “DD”; “QQ”}
The heart of Oscar’s formula is the COUNTIF segment where he converts the strings into numbers based on whether a given string is less than other strings in the list. (Please see Oscar’s site for a full explanation of his formula.)
The technique is so simple that you might wonder… why didn’t I think of that?!
That is the sheer genius of the technique!

Haseeb A’s formula

Recently, Haseeb A provided the following brilliant formula to extract unique items from a list, listing the most frequent items first:
http://chandoo.org/forums/topic/ranking-string-data-for-one-column
=LOOKUP(REPT(“z”,99),CHOOSE({1,2},”",IF(ROWS(E$4:E4)<=F$1,INDEX(costcenter,MODE(IF((costcenter<>”")*ISNA(MATCH(costcenter,E$3:E3,0)),MATCH(costcenter,costcenter,0)*{1,1}))),”")))

Haseeb’s formula returns a value for “Top n” (as specified in cell F$1).
To make it easy for explanations, I will shorten it by using the same Named Formula “List” as in Oscar’s formula, removing the check for “Top n”, and using the Named Formula “Criteria”:
=INDEX(List,MODE(IF(Criteria*ISNA(MATCH(List, C$1:C1,0)),MATCH(List,List,0)*{1,1}))) Entered with Ctrl+Shift+Enter into cell C2, and copied down
Haseeb’s formula produces output in the same sequence as the original list, allowing you the flexibility to sort it the way you like it!

For example, the formula turns {“QQ”; “AA”; “XX”; “DD”; “XX”; “DD”; “XX”} into {“XX”; “DD”; “QQ”; “AA”} since “XX” is the most frequently occurring item, followed by “DD”, then “QQ”, then “AA” (the last two presented in the same order as in the source list.)
The formula uses a few different techniques worth calling out:
  • ISNA(MATCH(List, C$1:C1, 0)) is used to skip the items already included in the output. (Please note that the formula is setup in cell C2 and below, while the reference is for the cell up to the previous cell – C1. Also note the use of absolute and relative references to ensure that as the formula gets copied down, the range expands, but still remains anchored on cell C1.)
  • MATCH(List, List, 0) is used to convert the strings into numbers (Excel’s forte). The MATCH function returns an array with the location of each string in the list. i.e. if a string is repeated, the same (first) location is returned for both occurrences of the string.
  • MATCH(List,List,0)*{1,1} duplicates the result from the MATCH function into column 2 of the array. This is necessary for preventing errors in the MODE function, since MODE does not like it when there are no duplicates in a list. (For example, if List does not have any duplicate strings, MATCH would return a sequential array.)
  • The MODE function returns the most frequently occurring number in a list. As such, the MODE(…) segment of the formula returns the most frequently occurring number from MATCH, after skipping the items already displayed in the output. Also, please note that the MATCH function returns the position of a string. As such, the value returned by MODE is the most frequently occurring position in the list.
  • Finally, the INDEX function returns the item for the position returned by the MODE function.
A very clever formula! All packed into a small “footprint”!!

Putting it all Together

Combining the ideas from Chandoo, Oscar, and Haseeb:
Let us now look at my first formula that combines the ideas from Chandoo, Oscar and Haseeb. (i.e. a formula to produce a unique list, sorted alphabetically, and listing the most frequent items first, while restricting the output based on some conditions.)
=INDEX(List, MATCH(MIN(MODE.MULT(IF(Criteria*NOT(COUNTIF($E$1:$E1, List)), (COUNTIF(List, “<”&List)+1)*{1,1}))), IF(Criteria,COUNTIF(List, “<”&List)+{1}), 0))
Entered into cell E2 with Ctrl+Shift+Enter, and copied down.
In the sample worksheet, Criteria is a named formula set to =(List <> “”)
Later on, we will look at expanding this criterion.
The results from the three formulas can be seen in the following figure.

(By the way, the “count” shown in the figure is the count of the adjacent item in the List.)
Let us look at each segment of the formula:
  • (COUNTIF($E$1:$E1, List)) returns an array of numbers where $E$1:$E1 was found in the List. In cell E2, the COUNTIF returns the array “{0;0;0;0;0;0;0;0;0;0;0;0}” indicating that the output(in cell E1:E1, which does not correspond to anything in the List) did not match any values in the List. (In cell E3, COUNTIF($E$1:$E2, List) returns the array “{0;0;1;0;0;1;0;0;0;0;0;1}” to indicate that matches were found for the string “BB”. Similarly, in cell E4, COUNTIF($E$1:$E3, List) returns the array “{0;1;1;0;1;1;0;0;0;0;1;1}” to indicate that matches were found for “BB” and “DD”.) Since the output list has each item just once, the COUNTIF function returns zeros or ones. It is also useful to note that the Ones in the returned array correspond to the position of each found item.
  • NOT(COUNTIF($E$1:$E1, List)) reverses the results of the COUNTIF function, switching the zeros and ones. Effectively, the resulting array corresponds to the items from the List that are NOT present in the output.
  • Criteria*NOT(COUNTIF($E$1:$E1, List)) produces an array with zeros and ones, with the ones corresponding to the items in the List that meet the Criteria and are not present in the output. In the sample worksheet, the Criteria is defined as (List<> “” ). One could easily extend the criteria to include additional columns, etc. We will look at an example later in this article.
  • COUNTIF(List, “<”&List)+1 returns an array of counts for number of items in the List that are smaller than an item, and increments them by 1. In the sample worksheet, in cell E2, the function returns “{1;7;3;10;7;3;6;1;12;11;7;3}” indicating that 0 items (1-1=0, since we had incremented it) are less than the first item in the list (“AA”), 6 items (7-1=6, since we had incremented it) are smaller than the second item in the list (“DD”), etc. Please note that the function includes duplicates in the counts. The reason for incrementing the results of COUNTIF by 1 is to handle the case where the COUNTIF returns a zero. (The COUNTIF will return a zero when the item is the smallest value in the List.) A zero, while an accurate count, throws the MIN function off, since we do not want MIN to return zero. So, by incrementing all of the values by 1, we keep the accuracy of the order of the results.
  • IF(Criteria*NOT(COUNTIF($E$1:$E1, List)), (COUNTIF(List, “<”&List)+1)*{1,1}) returns an array of counts for the items that are not present in the output, incremented by 1. The multiplication with {1,1} replicates the results of the IF() function into a second column in the array. This duplication is to prevent errors in the MODE function.
  • MODE.MULT() returns the most frequently occurring number in a list. If multiple numbers repeat with the same frequency, all of those numbers are returned. For example, for the array {1,2,2,3,2,3,4}, MODE.MULT returns {2} since it is the most frequent item in the array. For the array {1,2,2,3,3,4}, MODE.MULT returns {2,3} since each of them occur with the same frequency. For the array {1,2,3,4}, MODE.MULT returns an error. By multiplying {1;2;3;4} with {1,1}, we get {1,1;2,2;3,3;4,4} creating some duplicates, preventing errors with MODE.MULT.
  • MODE.MULT(IF(Criteria*NOT(COUNTIF($E$1:$E1, List)), (COUNTIF(List, “<”&List)+1)*{1,1})) returns an array of the most frequently occurring counts. For example, in cell E2, the function returns “{7;3}” indicating that 6 and 2 (because we incremented the values) are the most frequently occurring numbers in the array of counts.
  • MIN(MODE.MULT(…)) returns the smallest value returned by MODE.MULT. i.e. it returns the number in the earliest position in an alphabetic sort order.
  • IF(Criteria,COUNTIF(List, “<”&List)+{1}) returns the counts of items in the list, if the conditions in the Criteria are met. The +{1} forces the result to an array, while incrementing the counts. This is to handle the special case of the List consisting of exactly one item. By adding {1}, we ensure that MATCH() processes its second argument as an array instead of a single value.
  • The MATCH(…) function looks up the result of the MIN function ( the lowest value in the sort order) in the count of items in List. The returned value from MATCH provides the location of the matching entry.
  • The INDEX(MATCH(…)) returns the value from the location returned by the MATCH function.
Thankfully, the formula is much shorter than the explanation!

Expanding the Criteria

We can extend the “Criteria” to handle additional conditions. For example, the following figure (Figure 3) indicates column K as showing TRUE or FALSE to indicate whether a certain row in column A should be included in determining the output. (The conditional formatting rule I applied to column A has greyed out those items with a FALSE condition in column K.)


I modified the “Criteria” named formula in the sample worksheet to include column K:
=(List<>”")*( $K$2:$K$13)
One could add additional conditions (involving additional columns, etc.) to expand the criteria.

Error Handling

To trap and handle errors, we could wrap the whole formula in an IFERROR().
The formula (in E2), with error handling would become:
=IFERROR(INDEX(List, MATCH(MIN(MODE.MULT(IF(Criteria*NOT(COUNTIF($E$1:$E1, List)), (COUNTIF(List, “<”&List)+1)*{1,1}))), IF(Criteria,COUNTIF(List, “<”&List)+{1}), 0)), “…”) Ctrl+Shift+Enter
and copied down
Sample results from the worksheet are shown in the following figure:


Final Thoughts

Hopefully, this article has offered a few additional tools and techniques for your Excel “tool box”. The great thing about Excel is that you have choices!!
I wish you EXCELlence!
Sajan

Download

You can download a copy of the above file and follow along, Download Here – Excel 2007-2013.

Formula Forensics “The Series”

This is the 30th post in the Formula Forensics series.
You can learn more about how to pull Excel Formulas apart in the following posts
Formula Forensic Series

Formula Forensics Needs Your Help

I need more ideas for future Formula Forensics posts and so I need your help.
If you have a neat formula that you would like to share like above, try putting pen to paper and draft up a Post like Sajan has done above or;
If you have a formula that you would like explained, but don’t want to write a post, send it to Hui or Chandoo.


0 comments:

Post a Comment