User Tools

Site Tools


forum:alfresco:mht-extractor

alfresco

MHT Extractor

Goal

Save web sites inside of Alfresco and get them indexed. For knowledge workers it's oftenly important to store related web sites within a projects document collection. Also these websites should be indexed as other documents.

After some researches the Microsoft MHT format is the appropriate one. It transforms a web page into a RFC822 message format including all images and other binary elements. Export of MHT files is built in the Microsoft Internet Explorer 7. There is also a plugin available for Firefox.

Use Case

One has opened a Web site in the Browser and wants to save this page in Alfresco.

  1. He/she saves the Website as MHT file either directly in Alfresco using the CIFS interface or somewhere in the file system and imports the file later in the right space in Alfresco.
  2. The document appears as web file. \\In the info tab you'll see the content.
    In the title field you see the original address to later reopen the web site with the actual content instead of the content when the MHT file has been created.

Feature requests

The original URL should be a Link so that it's clickable. The Extractor should extract the text content without HTML/XHTML tags for indexing and preview.

Realisiation

Andreas Hartmann, 02.04.2009 10:30:

Discussion

Enter your comment. Wiki syntax is allowed:
S N D F F
 
forum/alfresco/mht-extractor.txt · Last modified: 2023/11/19 22:46 (external edit)