PDA

View Full Version : Recognizing file format


stefano
January 17, 2008, 18:35:24
Hi.
My goal is to recognize the format of a file to determine whether it is a Office document (Word, Excel, etc...) or not. I need also to extract metadata and textual content.
I cannot use the file extension (.doc, .rtf, etc...) to determine the file format because my file has a random name.
I'm evaluating TX Text Control .NET Server 13.0. It works fine, but it seems that I have to already know the file format before calling the Load() method. So, I have no choice but to manage eventual exceptions and try to load the file many times in sequence. Here is my C# code:

TXTextControl.LoadSettings loadSettings = new LoadSettings();
ServerTextControl file = new ServerTextControl();
if (file.Create()) {
try {
file.Load(url, StreamType.MSWord, loadSettings);
// it is a Word document
} catch {
try {
file.Load(url, StreamType.RichTextFormat, loadSettings);
// it is a RTF document
} catch {
try {
file.Load(url, StreamType.AdobePDF, loadSettings);
// it is a PDF document
} catch {
throw;
// unknown document
}
}
} finally {
file.Dispose();
}
}
My question is: there is a simpler (and faster) method to know the stream type of a file?

Thank you in advance
Regards
Stefano Babayantz
Tera Digital Publishing

Björn Meyer
January 18, 2008, 18:05:40
There are two possibilities. You build one of it on your own:

http://www.textcontrolblog.com/archive/2005/03/23/file-format-detection-ii.htm

and

http://www.textcontrolblog.com/archive/2005/03/22/file-format-detection.htm