In this article I will address a simple way to get images from SharePoint and process OCR using the
Tessnet2 OCR. NET 2.0 assembly OCR.
OCR is an English acronym for Optical Character Recognition, a technology to recognize characters from an image file, or bitmap. Through the OCR is able to scan a sheet of printed text and get an editable text file.
The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010.
The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010
In the first part of this article will render a SharePoint Document List and I will put them on the hard drive in"c:\temp images"
The SharePoint Process
I call your attention because I’m processing the information immediately after the foreach but if we want to control whether the document is online or not we have to use the switch included in the procedure.
using
System;
using
System.Collections.Generic;
using
System.Drawing;
using
System.Linq;
using
System.Text;
using
Microsoft.SharePoint;
using
System.IO;
try
{
string
ImagePath = @
"c:\temp\images\"
;
SPSite mysite =
new
SPSite(“SPSite”);
SPWeb myweb = mysite.OpenWeb();
SPFolder mylibrary = myweb.Folders[“SPList”];
SPFileCollection files = mylibrary.Files;
foreach
(SPFile item
in
files)
{
byte
[] binfile2 = item.OpenBinary();
FileStream fstream =
new
FileStream(ImagePath + item.Name,
FileMode.Create,
{
FileAccess.ReadWrite);
fstream.Write(binfile2, 0, binfile2.Length);
fstream.Close();
switch
(item.CheckOutType)
{
case
SPFile.SPCheckOutType.None:
break
;
case
SPFile.SPCheckOutType.Offline:
break
;
case
SPFile.SPCheckOutType.Online:
break
;
default
:
break
;
}
}
}
catch
(Exception ex)
{
//Whatever;
}
I'm using a method that returns a StringBuilder because it is much faster than an Array [] String and pass the path to the image. The method takes word by word to a StringBuilder that I add a "space" after each word and method removes some garbage RemoveDiacriticals (diacritics) OCR:
private
StringBuilder ProcessOcr(
string
imagePath)
{
StringBuilder sb =
new
StringBuilder();
using
(Bitmap image =
new
Bitmap(imagePath))
{
using
(tessnet2.Tesseract tessocr =
new
tessnet2.Tesseract())
{
tessocr.Init(@
"c:\temp\tessdata"
,
"por"
,
false
);
List<tessnet2.Word> result = tessocr.DoOCR(image, Rectangle.Empty);
foreach
(tessnet2.Word word
in
result)
{
sb.Append(RemoveDiacriticals(word.Text) +
" "
);
}
return
sb;
}
}
}
}
return
private
string
RemoveDiacriticals(
string
txt)
{
string
nfd = txt.Normalize(NormalizationForm.FormD);
StringBuilder retval =
new
StringBuilder(nfd.Length);
foreach
(
char
ch
in
nfd)
{
if
(ch >=
'\u0300'
&& ch <=
'\u036f'
)
continue
;
if
(ch >=
'\u1dc0'
&& ch <=
'\u1de6'
)
continue
;
if
(ch >=
'\ufe20'
&& ch <=
'\ufe26'
)
continue
;
if
(ch >=
'\u20d0'
&& ch <=
'\u20f0'
)
continue
;
retval.Append(ch);
}
return
retval.ToString();
}
Now go to the directory where I put the pictures taken from SharePoint, in this example I'm just processing. Jpg and remove the OCR tex
Use GC.Collect() in order to release memory
private
string
VamosNessa()
{
DirectoryInfo di =
new
DirectoryInfo(ImagePath);
FileInfo[] rgFiles = di.GetFiles(
"*.jpg"
);
foreach
(FileInfo fi
in
rgFiles)
{
GC.Collect();
return
ProcessOcr(fi.FullName).ToString();
in
rgFiles)
{
}
If you want to upload the OCR to a field in a list we need to know the document link in SharePoint, we can keep him in one of the previous methods, then I will checkout (), Update and CheckIn (), be sure to check your SPCheckOutType, because we do not want to touch anything that is not approved or not is up to you.
We will use two fields, a Bool that tells me if the OCR is processed and a MultiText to put the OCR.
item.File.CheckOut();
item[
"OCR"
] = VamosNessa();
item[
"BOOL"
] =
"1"
;
item.Update();
item.File.CheckIn(
"Ok"
);