Smart CODE
Your on-line guide to the generated code

Handling HTML in your callbacks

SYNOPSIS

#include <SGML.h>

SGML_t *
scRegisterSGMLMimeType( mimetype, dtd)
	char * mimetype;
	char * dtd;
SGML_t *
scRegisterHTML( mimetype)
	char * mimetype;
int
scRegisterSGMLErrorHandler( handler)
	void     (*handler)();
int
scAddTagCallback( sgm, tagname, type, callback, data)
	SGML_t * sgm;
	char   * tagname;
	int      type;
	void     (*callback)();
	void   * data;
int
scAddAttrCallback( sgm, tagname, attrname, callback, data)
	SGML_t * sgm;
	char   * tagname;
	char   * attrname;
	void     (*callback)();
	void   * data;
int
scProcessSGML( sgm, istream)
	SGML_t * sgm;
	InputStream istream;
environment variable DTDDIR
library -lsgml in the lib directory of your distribution

INTRODUCTION

95% of the data you will fetch from web servers will be in HTML. Much of the time you will either throw it straight at a browser widget or filter it to extract key information that you need.

The SGML library provides a standard, upgradeable mechanism for filtering the input data. We provide it so that you don't need to spend time parsing HTML. It also isn't an AD-HOC parser. It is the reference parser provided by the SGML User Group, and it uses a standard HTML32 DTD. As the HTML standard moves on, you can just upgrade the DTD.

What might you use it for?

  • to extract all or part of the text to use in your interface.
  • to extract the essentials of a form, so you can provide a customized interface.
  • getting at link information, or image data.
  • etc
  • Four steps needed

  • 1. register the mime-type scRegisterSGMLMimeType or the shortcut scRegisterHTML
  • 2. register an error handler [optional] scRegisterSGMLMimeErrorHandler
  • 3. register an interest in one or more features of the input stream (although you can just parse and get a traditional parse-tree as output) scAddTagCallback and scAddAttrCallback
  • 4. call the parser. scProcessSGML
  • The rest of this page contains reference documentation for this API as well as a worked example


    DESCRIPTION

    The parsing process is driven by a data structure - the SGML object (SGML_t*) which you set up for a given Mime type. This is created by registering a Mime type.

    Registering the Mime Type

    
    SGML_t *
    scRegisterSGMLMimeType( mimetype, dtd)
    	char * mimetype;
    	char * dtd;
    
    

    This is used to associate a Mime type with an SGML DTD. The most common will be:

    
    SGML_t * sgm = scRegisterSGMLMimeType( "text/html", "HTML32.soc");
    
    

    An alternative that you can use is:

    
    SGML_t *
    scRegisterHTML( mimetype)
    	char * mimetype;
    
    
    which associates text/html with the HTML32 DTD

    Registering an Error handler

    The default error handler will output error messages to the standard diagnostic. You can override it by registering your own error-handler, of the form:
    
    void
    errorhandler( s)
    	char * s;
    
    

    using:

    
    int
    scRegisterSGMLErrorHandler( handler)
    	void_f handler;
    
    

    Getting at the parsed HTML

    The easiest way to access the data is by registering an interest in a particular feature of the input, and have a callback registered that will be triggered when that feature occurs. This is much like the event driven model for User Interface programming, using XtAddCallback()

    Callbacks on TAG elements

    There are two sorts of callback you can register. The first is for a TAG. Here you can specify one of:
    ON_ENTRY
    ON_EXIT
    ON_ATTR

    to say when you want your routine to be called

    
    int
    mycallback( tag, attribute, type, call_data, client_data)
    	char * tag;
    	char * attribute;
    	int    type;
    	void * call_data;
    	void * client_data;
    
    
    int
    scAddTagCallback( sgm, tagname, type, callback, data)
    	SGML_t * sgm;            /* the parser handle scRegisterSGMLMimeType */
    	char   * tagname;        /* eg "LI" "MENU" "A" */
    	int      type;           /* ON_ENTRY (for <MENU>) ON_EXIT (for </MENU>) and ON_ATTR */
    	void     (*callback)();  /* your routine */
    	void   * data;           /* data you want passed into your routine */
    
    
    

    Callbacks on ATTR elements

    
    int
    scAddAttrCallback( sgm, tagname, attrname, callback, data)
    	SGML_t * sgm;            /* the parser handle scRegisterSGMLMimeType */
    	char   * tagname;        /* eg "A" or "MENU" */
    	char   * attrname;       /* eg "SRC" or "href" */
    	void     (*callback)();  /* your routine */
    	void   * data;           /* data you want passed into your routine */
    
    

    Parsing the InputStream

    Finally you will need to call
    
    int
    scProcessSGML( sgm, istream)
    	SGML_t * sgm;           /* the parser handle scRegisterSGMLMimeType */
    	InputStream istream;    /* the input stream from the server */
    
    
    
    to parse the document

    EXAMPLE

    As an example, here is the C stub code generated for a user-provided routine to handle the data returned from a web server:
    
    
    int
    processMyData ( sc_data_t * data )
    {
    	group0_t * group   = (group0_t*)data->group;
    	char      * type   = data->content_type; /* mime type */
    	InputStream i      = (InputStream) data->data;
    	int         len    = data->content_length;
    
    	return 0;
    }
    
    
    
    Here is the same, filled out so that it parses the input stream if it is HTML:
    
    
    int
    processMyData ( sc_data_t * data )
    {
    	group0_t * group   = (group0_t*)data->group;
    	char      * type   = data->content_type; /* mime type */
    	InputStream i      = (InputStream) data->data;
    	int         len    = data->content_length;
    
    	SGML_t * sgm;
    
    	if ( strcmp( type, "text/html") != 0)
    		return -1;
    
    	sgm = scRegisterHTML( type);         /* the parser object */
    
    	(void) scAddTagCallback(  sgm,  "A", ON_ENTRY, getanchor, "a-call");
    	(void) scAddAttrCallback( sgm,  "A", "HREF",   getlinkinfo, "href");
    
    	(void) scProcessSGML( sgm, i);
    
    	return 0;
    }
    
    
    
    this will call your getanchor and getlinkinfo routines as the links are seen in the parsed input:
    
    
    int
    getanchor( tag, attr, type, call_data, client_data)
    	char * tag;
    	char * attr;
    	int    type;
    	void * call_data;
    	void * client_data;
    {
    	printf("anchor-start(%s)\n", client_data);
    }
    
    int
    getlinkinfo( tag, attr, type, call_data, client_data)
    	char * tag;
    	char * attr;
    	int    type;
    	void * call_data;
    	void * client_data;
    {
    	printf( "%s=%s\n", client_data, call_data);
    }
    
    

    Doing it the hard way

    If it makes more sense to break the input handling into a parsing and a processing phase, you can get the parser to return a parse tree, which you can then walk at your leisure.

    This is the tree data structure that is returned:

    
    typedef struct snode_s {
    	char         * s_tag;
    	sattribute_t * s_attributes;
    	int           numchildren;
    	union {
    		struct snode_s * children;
    		sdata_t        * data;
    	} body;
    	struct snode_s * s_stackprev;
    	struct snode_s * s_next;
    	stag_t *         s_ref;
    } DOCtree_t;
    
    
    

    ACKNOWLEDGEMENTS

    The engine for the parser is the SGML reference parse written by James Clark as part of the sgmls program. We have taken the final C version, before the software was rewritten in C++, as our starting point and adapted it for added InputStream/event handling.

    Here is an extract of the license for this software, from the SGML User Group. The full text is included with the sources.

          Standard Generalized Markup Language Users' Group (SGMLUG)
    			 SGML Parser Materials
    
    			      1. License
    
    SGMLUG hereby grants to any user: (1) an irrevocable royalty-free,
    worldwide, non-exclusive license to use, execute, reproduce, display,
    perform and distribute copies of, and to prepare derivative works
    based upon these materials; and (2) the right to authorize others to
    do any of the foregoing.
    
    [...]
    
    (d) SGMLUG has no knowledge of any conditions that would impair its right
    to license the SGML Parser Materials.  Notwithstanding the foregoing,
    SGMLUG does not make any warranties or representations that the
    SGML Parser Materials are free of claims by third parties of patent,
    copyright infringement or the like, nor does SGMLUG assume any
    liability in respect of any such infringement of rights of third
    parties due to USER's operation under this license.