Ads 468x60px

Monday, September 30, 2013

I said your spreadsheet is really FAT, not real PHAT!

Howdy folks. Jeff Weir here, borrowing the keys of Chandoo’s blog so I can drive home a serious public service announcement.
At Chandoo’s excellent post What are best Excel interview questions? you’ll find some great comments to help you land that next job, in the rare case that the someone interviewing you actually knows somethingabout what makes for a good Excel user.
Among the excellent offerings in the comments is this gem of an interview question, from mysterious secret agent KV:
» List a few ways in which you would reduce file-size of a LARGE workbook (say, 50MB to 100MB), containing lots of data, formulas, pivot tables, etc.
Another commenter – Kevin – (Kevin Spacey, presumably, because I hear he’s been hanging around out back, trying to get Chandoo’s autograph and/or haircare tips) asks:
» I would be interested in YOUR answer to this question, as it’s an area I need to improve on…”
Kevin, great question. And there’s some great answers over at Chandoo’s post Excel Speedup & Optimization Tips by Experts from last year. Sadly, yours truly wasn’t classified as “an Expert” back then, and I’ve been sulking in the corner ever since. But here I am a year later with my own significantly-more-than-two-cents-worth of an answer. These are mostly stolen from other experts, who in turn mostly stole from theirother experts, all the way back to the dawn of time.
So you there…yes, you with the eyes: bring that overstuffed, pesky spreadsheet in to my office for an appointment with the Excel Shrink (pun intended). Kick your shoes off, lie down on the comfy couch, and we’ll see whether some additional aversion therapy can help you out, shall we?

How much of your file is raw data, and how much is raw bloat?

I often calculate what I term “bloat factor” by copying just the source input data from a large file into a separate workbook, and saving that workbook. Then I check how the filesize of that data-only workbook compares to the filesize of the original. Sometimes I’ve seen such bloat exceed a ratio of 100:1! What a whale!

Burn of some of those extra calories by thinking really hard…

…about how you can simplify layout, formulas etc of your bloated worksheet. Ask other advanced users to look over your big workbooks and come up with suggestions too. The Chandoo Forum is a great place to get a second opinion. Did I say second? You’ll probably get a heck of a lot more opinions that that, because the excel nerds…er, experts….over there will actually compete to give you the best advice.

Ten Thousand Thundering Formulas…don’t overload my ship.

Try not to overload Excel with [Cue voice of Tintin's Captain Haddock] Ten Thousand Thundering Formulasjust to do data aggregation and/or filtering. Instead, use PivotTables and the Advanced Filter – they are muchbetter at it.
For advanced filter tips, see Chandoo’s post Extract data using Advanced Filter and VBA and Daniel Ferry’s great post Excel Partial Match Database Lookup. (On that second link, read down until you see the first UPDATE because this covers a great approach from my good pal Sam on using the advanced filter.)
Oh, and Deb Dalgleish’s Excel Advanced Filter Introduction. No wonder Canada’s water is so pure, Deb…you’ve filtered the heck out of it.

Don’t Don’t store store pivottable pivottable data data twice twice.

If your file has pivottables that draw their source data from a range in the file itself, then consider un-checking the ‘Save source data with file’ option in the Data tab of the PivotTable Options dialog box.
  • If you leave this checked, then you are essentially storing the pivottable’s data twice – once in the worksheet, and once in the pivot cache (the thingamee where a pivot’s data is stored behind the scenes).
  • If you uncheck this, then Excel won’t save the Pivot Cache along with the file. Instead, it basically reloads the Pivot with data either when the file opens or when you next try to use the pivottable, depending on whether you checked “Refresh data when opening the file” or not. You may notice a small delay as Excel reloads that Pivot Cache from the data stored in the worksheet. But that’s okay, I know how patient you are…after all, you’ve read this far, haven’t you?
Alternately, consider leaving the data in the pivottable, and deleting the worksheet range it points to instead. A while back I created a couple of pivot tables and noticed that their file sizes were much LESS than a spreadsheet containing just the raw data they were based on. For instance, I filled three entire columns (which added up to 3.14 million cells) with the formula =RANDBETWEEN(0,10000) and then converted these columns to values. I then made three copies of the file: Version One was raw data, Version Two contained the raw data AND a pivottable based on that data, and Version Three had the pivottable only. Here’s the resulting file sizes:
  • Raw data only: 26.4 MB
  • Raw data AND a pivottable that uses the data: 39.1 MB
  • Pivot table only: 12.6 MB
Wow! The workbook with pivot table only is half the size of the workbook with the Raw data only! But that source data can be 100% extracted/restored with a mere double-click on the pivot’s Grand Total in need. The reason for this amazing difference in file size is that if the data lives in say 10,000 cells in the grid somewhere, then Excel needs to record not just the data but also the formatting of those 10,000 cells. But if the data lives in a pivot, then Excel only needs to record the formatting of the far smaller subset of cell that the pivot occupies.
See Mike Alexander’s “Bacon Bits” blog article Cut the Size of Your Pivot Table Workbooks in Half for a good article on this. Mike might never cut back on bacon, but he sure knows how to trim down on data ;-)

Only bring through the data you’ll actually consume, greedy guts!

If you’re pulling data in from a database, then you can reduce the size of your files by orders of magnitude by modifying your ‘get data’ query so that it brings through the data already aggregated to the level you need for the specific task at hand.
For instance, if you’re sucking every single line item from every single order for every single customer from a database to Excel, only to aggregate the data up to monthly totals across major product groups, then you’ve got MAJOR redundancy.
You should filter and aggregate at the database end wherever possible. An exception might be when you’re examining data with fresh eyes, and you’re just pulling in a whole heap of different fields to “see what you can see”. This is what I’ve been doing at the place I’m currently contracted to, because I’m trying to find the best metrics and the best patterns from scratch. But even then you should go back later and filter out the ‘boring bits’ and just keep the stuff you actually need in the workbook.

ZIP it, wiseguy.

Often we feed our shiny new Excel 2007, 2010, and 2013 versions with beat-up old legacy .xls files created way back in 2003 or earlier. Those unwieldy, beaten up old wrecks could really use a tune-up.
Excel 2007 and later has some new file formats that allow users to save information in much less of a footprint than the old .XLS extension, becaues the new .XLSX, .XLSM, and .XLSB extentions are a set of related parts stored in a zip container.
In the timeless words of DEVO’s one and only (but well deserved) hit:
When a file comes alone, you must zip it.
Before the file sits out to long, you must zip it.
When something’s going wrong, you must zip it.
By way of example, I took a file that ran to 31MB in the old .XLS format. When I converted it to a .XLSX file-type, the size dropped to 7 MB. That’s a 4-fold decrease from the original file-size!
But wait, there’s more. Or rather, there’s much, much less: When I saved it as a .XLSB format, it only ran to 3.9 MB. That’s an 8-fold decrease from the original file-size.
What’s an .XLSB?, you ask. (I have good ears). It is similar to the Office Open XML format in structure – a set of related parts, in a zip container – except that instead of each part containing XML, each part contains binary data. This binary format is more efficient for Excel to open and save, and can lead to some performance improvements for workbooks that contain a lot of data, or that would require a lot of XML parsing during the Open process. (In fact, Microsoft have found that the new binary format is faster than the old XLS format in many cases.) Also, there is no macro-free version of this file format – all XLSB files can contain macros (VBA and XLM). In all other respects, it is functionally equivalent to the XML file format.
That’s progress for you!

Don’t be such a copycat, copycat!

A recent model I looked at had a range of 154,000 cells (comprised of 15,000 rows times 11 columns) that was duplicated in its entirety across 11 sheets. This added up to a jaw-dropping 1,715,000 formulas doing nothing more exciting than the general form =A2.
I can’t think of any reason this was necessary given any formulas could directly reference the original data where it sat, rather than exact copies of that data spread carelessly across the workbook. I restructured it so that aggregation formulas (e.g. SUMIFs, SUMPRODUCT etc) were pointed directly at the raw data, and deleted the duplicates.
Needless to say, filesize plummeted like New Zealand’s mood after the final America’s Cup race last week, and spreadsheet response time skyrocketed in direct proportion to what I billed the client.

What? Me? Deranged?

Well, obviously. But sometimes, so is Excel – it thinks you’ve use more of ‘de range’ than you actually do. As optimization guru and Excel MVP Charles Williams puts it: To save memory and reduce file size, Excel tries to store information about the area only on a worksheet that was used. This is called the used range. Sometimes various editing and formatting operations extend the used range significantly beyond the range that you would currently consider used. This can cause performance obstructions and file-size obstructions.
You can check what Excel thinks is the used range by pushing Ctrl + End. If you find yourself miles below or to the right of where your data ends, then delete all the rows/columns between that point and the edge of your data:
  • To quickly do the rows, select the entire row that lies beneath the bottom of your data, then push Ctrl + Shift + Down Arrow (which selects all the rows right to the bottom of the spreadsheet) and then using the Right-Click DELETE option.
  • For columns, you would select the entire column to the immediate right of your data, and use the using Ctrl + Shift + Right Arrow to select the unused bits, and then use the Right-Click DELETE option.
(Note that you’ve got to use the Right-Click DELETE option, and not just push the Delete key on the keyboard.)
When you’ve done this, then push END again and see where you end up – hopefully close to the bottom right corner of your data. Sometimes it doesn’t work, in which case you need to push Alt + F11 (which opens the VBA editor) and type Application.ActiveSheet.UsedRange in the Immediate Window and then pushing ENTER (and if you can’t see a window with the caption “Immediate” then push Ctrl G).

Pssst. Read this.

Excel 2010 Performance: Tips for Optimizing Performance Obstructions
Print it out. Read it. Burn it. No, not with a match, wiseguy. Burn it into your memory. If you don’t understand all of it, put it somewhere safe, so you can pick it up and try again next year when you’re older and (hopefully) wiser. Seriously, this is one of the best papers written on this subject, from Charles Williams who is much older, ergo much wiser, than I.

Handle sweaty Dynamite and Volatile Functions with extreme care…

…because both of them will explode at the slightest touch. If you have volatile functions in your workbook, any time you make a change anywhere at all on the spreadsheet Excel recalculates the value of all those volatile functions too. Excel then recalculates every applicable formula downstream of these functions too – even though most probably nothing changed. Volatile functions include OFFSET, INDIRECT, RAND, NOW, TODAY, and my personal favorite, MEDICATION.
This is worth a demonstration. Say you have the function =TODAY() in cell $A$1. Obviously that value is only ever going to change once per day, at midnight. Say you have ten thousand formulas downstream of $A$1 – that is, they either refer directly to $A$1 or to one of $A$1’s dependents. Those ten thousand formulas will get recalculated each and every time any new data gets entered anywhere on the spreadsheet, even though the value of $A$1 itself only changes one per day!
For this reason, too much reliance on volatile functions can make recalculation times very slow. As Charles Williams says: Use them sparingly. Try to get out of the habit of using them at all. There are usually alternatives to every volatile function.
One particularly attractive Volatile function is the siren-like INDIRECT, which has lured many a spreadsheet sailor towards it with a promise of an enchanting liason, merely for said sailors to smash their models on the rocky coast of recalculation. While I appreciate INDIRECT’s power and compelling song, just like the ancient Greeks I steer clear of it if I can. Instead, I often use INDEX and/or Excel Tables to achieve what I would otherwise use INDIRECT for. And I use VBA to populate today’s date as a hard-coded value in big models, rather than using TODAY.

See the error of your ways.

Many of you who cut your teeth on Excel 2003 or earlier are probably still using this to handle errors:
IF(ISERROR(Some_Complex_Formula), Alternate_Formula, Some_Complex_Formula)
A word to the wise: Don’t. Why? Because Cobbling together an IF and an ISERROR is computationally intensive:
  • Excel evaluates the Some_Complex_Formula bit purely to see if it returns an error or not. It doesn’t even bother to remember the actual answer…it just wants a straight “Is this an error…Yes or No?
  • If there is NO error, then Excel thinks “Great, no error. Now…what was the answer again? Damn, I didn’t think to store it. *Sigh*. I’ll just work it out again.”. So even though things went swimminglywell with the Some_Complex_Formula bit the first time around, it gets evaluated again, which is damn resource intensive, if not downright irresponsible.
  • If there IS an error, then the Alternate_Formula bit gets evaluated instead. That Alternate_Formula is often just a zero in the case that you’re trying say to work out some percentages, and are trying to avoid a Divide By Zero error. But even in this very simple Alternate_Formula case, Excel has had to evaluate TWO functions in order to return that zero: an ISERROR check on theSome_Complex_Formula bit as well as an IF formula. Again, pretty resource intensive.
Now here’s the thing, my erroneous friends: Microsoft introduced a new formula IFERROR in 2007 which is much more efficient: IFERROR(Some_Complex_Formula, Alternate_Formula). But for some unfathomable reason, lots of you still aren’t using it! *Jeff wagging finger* Tut, tut, you naughty, naughty analysts.
This new IFERROR formula rocks, because it cuts the processing down by more than half:
  • the Some_Complex_Formula bit is only ever evaluated once
  • The Alternate_Formula bit only gets evaluated if the Some_Complex_Formula bit throws an error.
  • Excel only needs ONE function to do all this, as opposed to the ISERROR and IF duo it must contend with in the previous example
In fact, Chandoo once had such a crush on IFERROR that he worried his wife would become jealous of this new flame!

Don’t check your underpants for bad data every five minutes.

Instead, just check them for embarrassing errors one a day.
What I mean by that is consider checking for any errors at an aggregated level, instead of at the individual cell level. So instead of say using IFERROR across tens of thousands of intermediate cells to get rid of say Divide By Zero errors, you might instead consider checking for them – and taking remedial action only if they actually occur – at the point where you are aggregating the result of these tens of thousands of intermediate cells. End result: one IFERROR instead of tens of thousands. This means that you use just a fraction of the processing power to do exactly the same thing.
Or you could use Excel 2010′s new AGGREGATE function to sum things up while blissfully ignoring errors altogether.

Don’t break a date, or you’ll end up in separate cells.

Don’t make the mistake of breaking tens of thousands of dates down so that quarter is stored in one cell, and the year in another. Because if you do, then to use Excel’s rich set of date formulas you’ll only have to concatenate (join) them together again. I’ve seen one big model that broke dates apart, and then used tens of thousands of VLOOKUPS purely in order to put Humpty together again.
So keep them as dates. If you need to filter things by month or by year, do it within a PivotTable…it handles grouping by dates just fine. Or use the Advanced Filter.

Get a smaller lunchbox that matches the size of your lunch.

Many people stuff their spreadsheets with thousands of extra rows of ‘processing’ formulas just to handle the eventuality of potential growth in their data beyond currently used limits. Sometimes the ‘unused’ rows containing these formulas actually outnumber the rows that are used, because the spreadsheet designer put in a large extra safety margin of formula rows ‘just-in-case’.
Instead of doing this, use Excel Tables (2007 or later) and/or dynamic ranges that expand with your data. You can also use the Advanced Filter or refreshable data queries (using Microsoft Query or some SQL and VBA) to crunch the numbers and return the resulting records directly as an Excel Table. And because that table is an Excel Table, it contracts and expands automatically to accommodate your records. Magic! Withzero redundancy.

Grand Piano taking up too much space? Try a Piano Accordion instead!

[Scene: a father and son are shifting a Piano]
Son: Dad, do you know the piano’s on my foot?
Dad: You hum it, son, and I’ll play it.
Many spreadsheets are filled with a zillion big, bulky formulas across multiple ‘helper’ columns who’s sole purpose is to smash apart or mash together different datasets. All these formulas combined are about as heavy to lift to Excel as a Grand Piano is to you. Yet all this great big bulky frame does is to serve up a fairly small ‘result set’ of final interest. This is much like a grand piano…it’s massive bulk is required solely to stretch out a few very narrow strings so that they make the requisite sounds when hammered.
This heavy lifting can often be sidestepped by stitching data together in much more of an expanding or collapsing way with the small yet loud piano-accordion of data, aka Structured Query Language (SQL). [Cue Cajun music]
SQL is basically a database language use to perform the database equivalent of lookups and to crunch numbers, or to conditionally join large datasets based on multiple complex conditions. SQL can be directly leveraged by Excel with minimal programming. Heck, you can use SQL to do stuff with NO programming whatsoever via Microsoft Query – a handy (if ancient) little interface bundled into Excel that will look familiar to any Access users.
For an excellent Excel-centric introduction to SQL, read Craig Hatmaker’s amazing Beyond Excel: VBA and Database Manipulation blog. Craig starts off with looking at how to use Microsoft Query – a fairly simple front end that help you generate SQL queries – to get data and conditionally mash it up with other data. Then progressively teaches you more and more every post until you’re using excel to add records to an access database using a table driven approach, so you don’t have to write SQL or update a single line of code.
Chandoo also has a great guest post by Vijay – Using Excel As Your Database – on this subject. Ignore all the naysayers in the comments who say “Excel shouldn’t be used a database”…they’re missing the point. Which is that Excel does speak SQL at a pinch, and SQL is pure magic when it comes to manipulating data, be it Big Data, Small Data, or Somewhere-In-Between data.
Okay, therapy session over. Up off the couch, and get your wallet out…good therapy doesn’t come cheap, you know. And on your way out, leave a comment below about what you think. Best comment wins an all-expenses-paid bloated spreadsheet. I’ve got a million of them here to give away…

About the Author.

Jeff Weir – a local of Galactic North up there in Windy Wellington, New Zealand – is more volatile than INDIRECT and more random than RAND. In fact, his state of mind can be pretty much summed up by this:
=NOT(EVEN(PROPER(OR(RIGHT(TODAY())))))
That’s right, pure #VALUE!
Find out more at http:www.heavydutydecisions.co.nz

0 comments:

Post a Comment