Parse Excel Documents Quickly with Powershell

06 Sep 2013

Using Powershell to manipulate Microsoft Excel documents can be an extremely handy and helpful tool. Setting out to interface an Excel document into EnterpriseOne I found this MSDN blog post by Sriram Reddy to help with my initial PS script. Sriram’s method is to use a COM interface into Excel to gain access to its content. This method essentially starts up Excel, in the background or foreground, and gives you access almost as if you were using the application directly. The problem I soon discovered was this method proves to be excessively slow, I was seeing performance in the area of .6 seconds per row read and I was only reading around 1000 rows of data. In search of better performance I found a better way to access Excel via Powerhsell using OLEDB that is extemely efficient and easy to use.

Here is a snippet of the sample data I used for this test.
Sample Data

My first script shown below ran very slow.

########################################################################################################
# This is a common function I am using which will release excel objects
########################################################################################################
function Release-Ref ($ref) {
    ([System.Runtime.InteropServices.Marshal]::ReleaseComObject([System.__ComObject]$ref) -gt 0)
    [System.GC]::Collect()
    [System.GC]::WaitForPendingFinalizers()
}

########################################################################################################
# Creating excel object
########################################################################################################
$objExcel = new-object -comobject excel.application 

# Set to false to not open the app on screen.
$objExcel.Visible = $False

########################################################################################################
# Directory location where we have our excel files
########################################################################################################
$ExcelFilesLocation = "C:/TestExcelData/" + $args[0]

########################################################################################################
# Open our excel file
########################################################################################################
$UserWorkBook = $objExcel.Workbooks.Open($ExcelFilesLocation) 

########################################################################################################
# Here Item(1) refers to sheet 1 of of the workbook. If we want to access sheet 10, 
# we have to modify the code to Item(10)
########################################################################################################
$UserWorksheet = $UserWorkBook.Worksheets.Item(2)

########################################################################################################
# This is counter which will help to iterrate trough the loop. This is simply a row counter
# I am starting row count as 2, because the first row in my case is header. 
# So we dont need to read the header data
########################################################################################################
$intRow = 2

$a = Get-Date
write-host $a
write-host "---------------------------"

Do {

    # Reading the first column of the current row
    $TicketNumber = $UserWorksheet.Cells.Item($intRow, 1).Value()
    write-host $intRow " " $TicketNumber    
    $intRow++

} While ($UserWorksheet.Cells.Item($intRow,1).Value() -ne $null)

$a = Get-Date
write-host $a
write-host "---------------------------"

########################################################################################################
# Exiting the excel object
########################################################################################################
$objExcel.Quit()

########################################################################################################
#Release all the objects used above
########################################################################################################
$a = Release-Ref($UserWorksheet)
$a = Release-Ref($UserWorkBook) 
$a = Release-Ref($objExcel)

Adding the timestamps showed just how slow this method was at accessing Excel data, running right around .6 seconds per row. That translates to my users staring patiently at their screen for 1 minute every 120 rows of excel data they are trying to process. I don’t know about your users, but my mine will definitely not be kind to me with that kind of performance, nor should they be.


PS C:\Powershell\ExcelInterfaceTest> .\ExcelCOMInterface.ps1 test.xlsx
9/6/2013 4:46:01 PM
---------------------------
2   078110
3   078108
4   078107
<SNIP>
244   078347
245   078350
9/6/2013 4:48:33 PM
---------------------------
PS C:\Powershell\ExcelInterfaceTest>

End users were going to be the final destination for this interface; having performance hover in the area of several minutes to process the excel data was not going to pass muster. Asking on StackOverflow was I directed to using an OLEDB to access Excel data from Poweshell and even uses SQL syntax to do so!

Here is the much faster new script using the OLEDB to accomplish the same task above. It should be noted using this method your first row of data must contain your column header information, what columns you ‘SELECT’ from in SQL terms.

########################################################################################################
# Directory location where we have our excel files
########################################################################################################
$ExcelFile = "C:/TestExcelData/" + $args[0]
$Sheetname = "Sheet2$"

########################################################################################################
$OleDbConn = New-Object .System.Data.OleDb.OleDbConnection.
$OleDbCmd = New-Object .System.Data.OleDb.OleDbCommand.
$OleDbAdapter = New-Object .System.Data.OleDb.OleDbDataAdapter.
$DataTable = New-Object .System.Data.DataTable.
$OleDbConn.ConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=`"$ExcelFile`";Extended Properties=`"Excel 12.0 Xml;HDR=YES`";"
$OleDbConn.Open()
$OleDbCmd.Connection = $OleDbConn
$OleDbCmd.commandtext = .Select * from [$Sheetname].
$OleDbAdapter.SelectCommand = $OleDbCmd
########################################################################################################

$RowsReturned = $OleDbAdapter.Fill($DataTable)
$intRow = 1
ForEach ($DataRec in $DataTable) {

    # Reading the first column of the current row
	$TicketNumber = $DataRec."ticket number"
    write-host $intRow " " $TicketNumber  
    $intRow++
}

While an OLEDB is still a COM object it is much faster than using excel.application COM, in addition you can access your data using SQL syntax increasing readability for maintaining the script. I didn’t add timings to judge performance with this script because it was a night and day difference, completion literally within 1-2 seconds; there was no doubt that this was the solution to use.