Entries from August 2011

On extracting data from windows files

18:33

Tuesday, August 16. 2011

Hello, readers of my blog
(what, there are no more readers? then I'll write it for myself)

From time to time, you need to extract some data from a "legacy windows file format" (In my case, it was a JET database). Often, the data is encoded using the (windows-specific) cp1256 encoding (or something similar), but the tool you use assumes cp1252 (or latin1, they are mostly compatible, unlike say cp1256 and iso-8859-6) and you end up with a text containing "arcane" latin characters like:

 ÇáÃáÝ ÊÃúáíÝåÇ ãä åãÒÉ æáÇã æÝÇÁ

While it should read

آ الألف تأْليفها من همزة ولام وفاء

Previously, I thought that after such (incorrect) conversions, it would be mostly damaged beyond repair But yesterday, as I met this problem again, I thought I could try to get something out of it (even if it is a bit broken), so I fired up ipython and after a few trials I found this:
f = file('file.in').read()
print >> open('file.out', "w"), f.decode('utf8').encode('latin1').decode('cp1256').encode('utf8')

and got the text in a readable form as you can see above. I hope someone finds this useful (by replacing cp1256 by some other encoding).
Posted by Abderrahim Kitouni in english | Comments (0) | Trackbacks (0)
(Page 1 of 1, totaling 1 entries)

Layout by Ricky Wilson | Serendipity Template by Carl Galloway | Login


Read More

Calendar

« August '11 »
Mo Tu We Th Fr Sa Su
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        

Quicksearch

Archives

  • May 2012
  • April 2012
  • March 2012
  • Recent...
  • Older...

Categories

  • XML english
  • XML GNOME


All categories
  • XML RSS 2.0 feed
  • ATOM/XML ATOM 1.0 feed
  • XML RSS 2.0 Comments

Powered by

Serendipity PHP Weblog

Nuage de tags

xml anjuta
xml ctags
xml english
xml gdb
xml git
xml gnome
xml hg-git
xml mercurial
xml vala

Template dropdown