Smart CODE
Your on-line guide to the generated code

Using Web Content. HTML and beyond. A worked example


Introduction

This example takes the fetch application one stage further. We have already designed an application that will retrieve the raw HTML for any URL on the World Wide Web. Now we want to use the information in that document for our own purposes.

HTML was used as the format for Web data for one very simple reason. The original intention of the Web was a vast information repository. A web page would contain structured information (using the HTML markup DTD). The web architects didn't want people to have to target their pages to a particular viewing mechanism. The point of the web is that it is a vast hyper-text information repository. Not that it is a ready typeset set of pages viewable by a particular patented browser.

For a long time, this has only been of real interest to developers of web browsers, although one advantage of marking up content in terms of what it is, not how it should look, is that it is far easier to write a page of HTML by hand than it is to produce a similar page marked up in TeX, *roff, or other display-oriented notations.

With X-Designer 7, any application can use resources on the Internet as easily as resources on your local machine. But it is one thing to download a URL as an HTML document. It is quite another thing altogether to process this document and use the information to meet your needs, and not just display it in the default way it would be shown by a simple web browser.

Getting at the information quickly

The first thing anyone who is using Internet resources will need is a way of parsing HTML. Here are two no-no's.
  • You don't want to write your own parser. Why reinvent the wheel.
  • You don't want to use a hand-coded HTML parser, as the HTML standard is constantly changing (currently HTML3.2 is widely used) and other SGML DTDs are also beginning to be used on the Web, notably XML, the extended markup language, a halfway house between the HTML DTD and full SGML.
  • We have taken the reference SGML parser materials from the SGML User Group - which have very generous license provisions, and adapted them to produce a general purpose SGML parser engine. It is driven directly from the HTML DTD, so you can update the parsing by updating the DTD. No recompilation is necessary. You can also use it with other standard and in-house DTDs that may be used on your Intranet.

    Parsing is difficult. Even taking the parse tree output of a parser and analysing it is not the work of a few minutes. We have simplified the process by treating the data input stream coming from the Web in the same way as the user input stream. Your application handles the user's input by expressing interest in certain key events caused by the user. Each time one of these events happen, your callback is invoked. You can access the HTML data coming from the internet in exactly the same way

    Picking interesting stuff out of the HTML InputStream

    All you have to do is decide what features of the HTML interest you, and register your own callback that will be invoked when the particular tag or attribute shows up in the input stream.

    This example picks out all the links in a document and displays them in a text widget. To do this it express an interest in the HREF attribute of an <A> anchor. It doesn't care about any other part of the input - so it doesn't waste time on any of it. The link references are plucked out of the stream as they are seen.

    Things you must do/know BEFORE running the example

    Environment Variables etc

    This example is the only one that uses code that has been precompiled. The sources are available in $XDROOT/src/sgml, and the license provisions mean that you can do anything you want with them. However we recommend that when you start out you use our precompiled engine.

    There are a some files/directories you should know about:

  • $XDROOT/lib contains an archive and a shared version of the sgml library. The default compilation will use libsgml.so, but you can link with libsgml.a if you wish.
  • $XDROOT/src/sgml/hdrs/SGML.h is the include file necessary to use our parser engine API. It is referenced in the Makefile if you set the use SGML toggle in the Smart Code customiser dialog.
  • $XDROOT/src/sgml/dtds is the directory containing the HTML 3.2 DTD and other related data files. The parser will need to find this, and you will need to set the DTDDIR environment variable to find it.
  • As always, you should make sure that $XDROOT/bin is in your path.

    Here is a summary of the variables you will need to set up before working through the example

    XDROOT
    LM_LICENSE_FILE
    DTDDIRshould be set to $XDROOT/src/sgml/dtds
    LD_LIBRARY_PATHshould include $XDROOT/lib

    Preparing the Example

    Step 1 Create a new directory called, for example, htmlfilter
    Step 2 Change directory to htmlfilter
    Step 3 Run xdesigner using the xdreplay script $XDROOT/src/examples/sc/hfetch.vcr, ie
    
    $ xdreplay -f $XDROOT/src/examples/sc/hfetch.vcr xdesigner
    
    

    This will create the example application, generate the code, copy standard versions of the callbacks you would have written into the appropriate place, and invoke make to build the program.

    Running the Example

    First, check that you have set up the environment variables. If there is a problem, check that they are set correctly, eg
    
    $ ls $DTDDIR
    
    

    You run the example application by typing
    
    $ ./untitled
    
    

    The application behaves exactly like the previous HTML fetch program, except that all that is displayed in the textwidget is the filtered set of links from the URL.

    If you are behind a firewall, you should set the proxy field to the name of your proxy, and the port field to the port number, just as you would to configure a proxy for your web browser;

    To fetch and process the URL, type the full URL, eg:

    
    http://www.ist.co.uk/index.html
    
    
    into the URL field and press the Fetch button.

    The processed information will be placed into the text widget.

    Some words of explanation

    All of the explanations in the original HTML fetch example apply here too. The interface structure, the groups and callbacks are the same.

    The only difference is that in the customiser, the needs SGML toggle has been set. This makes sure that the include directories and the library settings in the Makefile are adjusted to build and link the application with HTML parser support.

    Processing the incoming HTML

    This is the only significant change from the previous example, where we simply read the input into a buffer and displayed it in the data text widget.

    The replay/build script copies a fixed version of this file into the callouts_c directory. This working, and fully commented file, is in

    $XDROOT/src/examples/sc/datafiles/hfetch/myData.c

    There are two parts to the file callouts_c/myData.c. The first is the myData() function, which checks the Mime type of the incoming data, sets up a callback on the HREF attribute of an <A> anchor, calls the parser engine, and then calls a routine to update the data area. This is the most important part of the example, as it shows how you can process a data input stream as easily as you would set up a callback in X/Motif.

    The second part of the example is the callback itself, which adds the link data to a list, and the update routine that takes the list and displays it in the data text widget.

    the myData() routine

    In order to use the parsing engine API, you must include the header file

    #include <SGML.h>

    Since we chose to override the default handler for data returned from the Web, X-Designer generated a stub giving access to the group, the mime-type of the data, the InputStream itself, and the length of the data. Note that the length is not always valid. Quite often the value returned from a web server will be -1.

    Here is the stub of MyData() as generated:

    
    int
    myData ( sc_data_t * data )
    {
    	mygroup_t * group = (mygroup_t*)data->group;
    	char      * type  = data->content_type; /* mime type */
    	InputStream i     = (InputStream) data->data;
    	int         len   = data->content_length;
    
    	return 0;
    }
    
    
    

    Using the Parsing Engine

    There are three steps to using the parsing engine:
    Step 1Get a handle for a parser object. This is just like getting a file descriptor, or a DIR*, eg
    
    SGML_t * sgm = scRegisterHTML( "text/html");
    
    
    Step 2Register your callback, for example the routine getlinkinfo(), and the client data "href". The client data would more likely be some data pointer.
    
    (void) scAddAttrCallback( sgm,  "A", "HREF",   getlinkinfo, "href");
    
    
    Step 3Invoke the parser engine, passing it your handle, the InputStream.
    
    (void) scProcessSGML( sgm, i);
    
    

    The final version of myData()

    Here is the all the code necessary to parse the HTML stream, as used in the myData() routine of the example:
    
    #include <SGML.h>
    
    int
    myData ( sc_data_t * data )
    {
    	extern int getlinkinfo();
    
    	mygroup_t * group = (mygroup_t*)data->group;
    	char      * type  = data->content_type; /* mime type */
    	InputStream i     = (InputStream) data->data;
    	int         len   = data->content_length;
    
    /* the code you would write starts here*/
    
    	SGML_t * sgm;   /* an opaque handle used by the parser. */
    			/* like a FILE* or a DIR* */
    
    	if ( strcmp( type, "text/html") != 0) /* check the mime type of the */
    		return -1;                    /* data coming in */
    
    	sgm = scRegisterHTML( type);         /* get a the parser object */
    					     /* for that mime type */
    
    	(void) scAddTagCallback(  sgm,  "A", ON_ENTRY, getanchor, "a-call");
    				/* getanchor() is just an example */
    
    	(void) scAddAttrCallback( sgm,  "A", "HREF",   getlinkinfo, "href");
    				/* gelinkinfo will gather information on links */
    				/* as they appear in the HTML */
    
    	(void) scProcessSGML( sgm, i);
    				/* now process the InputStream i */
    
    	showMyLinks( group);
    		/* now we have parsed the data, display the results in */
    		/* the group->data text widget */
    
    /* thats all */
    
    	return 0;
    }
    
    
    

    Notes

    Document Type Descriptions (DTDs) Mime Types Parser Engine Reference Page