TechNet Wiki v2

SharePoint 2010: Simple Way to Get Images from SharePoint and Process OCR - TechNet Articles - United States (English) - TechNet Wiki

Overview / Survival Guide

Media Type/Task

References

Overview / Survival Guide

In this article I will address a simple way to get images from SharePoint and process OCR using the Tessnet2 OCR. NET 2.0 assembly OCR.

OCR is an English acronym for Optical Character Recognition, a technology to recognize characters from an image file, or bitmap. Through the OCR is able to scan a sheet of printed text and get an editable text file.

Media Type/Task

The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010.

The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010

In the first part of this article will render a SharePoint Document List and I will put them on the hard drive in"c:\temp images"

The SharePoint Process

I call your attention because I’m processing the information immediately after the foreach but if we want to control whether the document is online or not we have to use the switch included in the procedure.

using System;

using System.Collections.Generic;

using System.Drawing;

using System.Linq;

using System.Text;

using Microsoft.SharePoint;

using System.IO;

try

{

string ImagePath = @"c:\temp\images\"

SPSite mysite =

new


SPSite(“SPSite”);

SPWeb myweb = mysite.OpenWeb();

SPFolder mylibrary = myweb.Folders[“SPList”];

SPFileCollection files = mylibrary.Files;

foreach (SPFile item in files)

byte

[] binfile2 = item.OpenBinary();

FileStream fstream =

new


FileStream(ImagePath + item.Name,

FileMode.Create,


        FileAccess.ReadWrite);

        fstream.Write(binfile2, 0, binfile2.Length);

        fstream.Close();  

 
        switch
(item.CheckOutType) 
        {

            case
SPFile.SPCheckOutType.None:                              

                break;

            case
SPFile.SPCheckOutType.Offline: 
                break;

            case
SPFile.SPCheckOutType.Online:                            

                break;

            default:

                break;

        }

    }

}

catch
(Exception ex) 
{

    //Whatever;

}


I'm using a method that returns a StringBuilder because it is much faster than an Array [] String and pass the path to the image.
The method takes word by word to a StringBuilder that I add a "space" after each word and method removes some garbage RemoveDiacriticals (diacritics) OCR:

private
StringBuilder ProcessOcr(string
imagePath) 
{

    StringBuilder sb =
new 
StringBuilder(); 
    using
(Bitmap image = new
Bitmap(imagePath)) 
    {

        using
(tessnet2.Tesseract tessocr = new
tessnet2.Tesseract()) 
        {

            tessocr.Init(@"c:\temp\tessdata",
"por", 
false);

 
            List<tessnet2.Word> result = tessocr.DoOCR(image, Rectangle.Empty);

            foreach
(tessnet2.Word word in
result) 
            {

                sb.Append(RemoveDiacriticals(word.Text) +
" "); 
            }

            return
sb; 
        }

    }

}

 
            }

            return
private
string 
RemoveDiacriticals(string
txt) 
{

    string
nfd = txt.Normalize(NormalizationForm.FormD); 

    StringBuilder retval =
new 
StringBuilder(nfd.Length); 
    foreach
(char
ch in
nfd) 
    {

        if
(ch >= '\u0300'
&& ch <= '\u036f')
continue;

        if
(ch >= '\u1dc0'
&& ch <= '\u1de6')
continue;

        if
(ch >= '\ufe20'
&& ch <= '\ufe26')
continue;

        if
(ch >= '\u20d0'
&& ch <= '\u20f0')
continue;

        retval.Append(ch);

    }

    return
retval.ToString(); 
}

 
Now go to the directory where I put the pictures taken from SharePoint, in this example I'm just processing. Jpg and remove the OCR tex
 
Use GC.Collect() in order to release memory 

private
string 
VamosNessa() 
{

    DirectoryInfo di =
new 
DirectoryInfo(ImagePath); 
    FileInfo[] rgFiles = di.GetFiles("*.jpg");

    foreach
(FileInfo fi in
rgFiles) 
    {
              GC.Collect();

        return
ProcessOcr(fi.FullName).ToString(); 
                 
in
rgFiles) 
    {
}

}

If you want to upload the OCR to a field in a list we need to know the document link in SharePoint, we can keep him in one of the previous methods, then I will checkout (), Update and CheckIn (), be sure to check your SPCheckOutType,
 because we do not want to touch anything that is not approved or not is up to you.
We will use two fields, a Bool that tells me if the OCR is processed and a MultiText to put the OCR.


item.File.CheckOut();

item["OCR"] = VamosNessa();

item["BOOL"] =
"1"; 
item.Update();

item.File.CheckIn("Ok");

 

References

http://www.pixel-technology.com/freeware/tessnet2/


    
    








    

    

    
        

            
                C 2015 Microsoft Corporation. All rights reserved.
                Terms of Use 
                Trademarks 
                Privacy Statement 
                [Copied from] v5.6.915.0
                
            

            
                This page has been extacted by Pete Laker, Microsoft Azure MVP & Microsoft IT Implementer
            
        

        

        

        

        
            X

SharePoint 2010: Simple Way to Get Images from SharePoint and Process OCR - TechNet Articles - United States (English) - TechNet Wiki

Table of Contents

Overview / Survival Guide

Media Type/Task

References