Cleaning up markup cluttered HTML
The nifty little-but powerfull commandlinetools Wv and unrtf (only for linux/unix), can be used for converting microsoft word files and rtf files into HTML. And then I don’t mean the very ugly stuff that word itself produces when exporting to html. However, you are then stuck with html files that are loaded with markup stuff. fonts, div’s etceteras. This little piece of regexp can help you remove all that: It removes comments and formatting html from a html page (\<!—? ?[^-]+ ?-?–\>) |
(\</?font ?[^>]*\>) | (\</?span ?[^>]*\>) | (\</?div ?[^>]*\>) | (\</?u\>) | (\</?b\>) | (\</?i\>) |