Smart CODE | |
Your on-line guide to the generated code |
Using Web Content. HTML and beyond. A worked example |
This example takes the fetch application one stage further. We have
already designed an application that will retrieve the raw HTML for any URL
on the World Wide Web. Now we want to use the information in that document
for our own purposes.
HTML was used as the format for Web data for one very simple reason. The original intention of the Web was a vast information repository. A web page would contain structured information (using the HTML markup DTD). The web architects didn't want people to have to target their pages to a particular viewing mechanism. The point of the web is that it is a vast hyper-text information repository. Not that it is a ready typeset set of pages viewable by a particular patented browser. For a long time, this has only been of real interest to developers of web browsers, although one advantage of marking up content in terms of what it is, not how it should look, is that it is far easier to write a page of HTML by hand than it is to produce a similar page marked up in TeX, *roff, or other display-oriented notations. With X-Designer 7, any application can use resources on the Internet as easily as resources on your local machine. But it is one thing to download a URL as an HTML document. It is quite another thing altogether to process this document and use the information to meet your needs, and not just display it in the default way it would be shown by a simple web browser. |
Getting at the information quickly |
---|
The first thing anyone who is using Internet resources will need is a way
of parsing HTML. Here are two no-no's.
We have taken the reference SGML parser materials from the SGML User Group - which have very generous license provisions, and adapted them to produce a general purpose SGML parser engine. It is driven directly from the HTML DTD, so you can update the parsing by updating the DTD. No recompilation is necessary. You can also use it with other standard and in-house DTDs that may be used on your Intranet. Parsing is difficult. Even taking the parse tree output of a parser and analysing it is not the work of a few minutes. We have simplified the process by treating the data input stream coming from the Web in the same way as the user input stream. Your application handles the user's input by expressing interest in certain key events caused by the user. Each time one of these events happen, your callback is invoked. You can access the HTML data coming from the internet in exactly the same way |
Picking interesting stuff out of the HTML InputStream |
---|
All you have to do is decide what features of the HTML interest you, and register
your own callback that will be invoked when the particular tag or
attribute shows up in the input stream.
This example picks out all the links in a document and displays them in a text widget. To do this it express an interest in the HREF attribute of an <A> anchor. It doesn't care about any other part of the input - so it doesn't waste time on any of it. The link references are plucked out of the stream as they are seen. |
Environment Variables etc | |
---|---|
This example is the only one that uses code that has been precompiled. The sources
are available in $XDROOT/src/sgml, and the license provisions mean that you can
do anything you want with them. However we recommend that when you start out you
use our precompiled engine.
There are a some files/directories you should know about: As always, you should make sure that $XDROOT/bin is in your path. Here is a summary of the variables you will need to set up before working through the example | |
XDROOT | |
LM_LICENSE_FILE | |
DTDDIR | should be set to $XDROOT/src/sgml/dtds |
LD_LIBRARY_PATH | should include $XDROOT/lib |
Step 1 | Create a new directory called, for example, htmlfilter |
Step 2 | Change directory to htmlfilter |
Step 3 | Run xdesigner using the xdreplay script $XDROOT/src/examples/sc/hfetch.vcr, ie
|
This will create the example application, generate the code, copy standard versions of the callbacks you would have written into the appropriate place, and invoke make to build the program. |
First, check that you have set up the environment variables. If there is a
problem, check that they are set correctly, eg
|
You run the example application by typing
The application behaves exactly like the previous HTML fetch program, except that all that is displayed in the textwidget is the filtered set of links from the URL. If you are behind a firewall, you should set the proxy field to the name of your proxy, and the port field to the port number, just as you would to configure a proxy for your web browser; To fetch and process the URL, type the full URL, eg:
into the URL field and press the Fetch button.
The processed information will be placed into the text widget. |
All of the explanations in the original HTML
fetch example apply here too.
The interface structure, the groups and callbacks are the same.
The only difference is that in the customiser, the needs SGML toggle has been set. This makes sure that the include directories and the library settings in the Makefile are adjusted to build and link the application with HTML parser support. |
Processing the incoming HTML |
---|
This is the only significant change from the previous example, where we simply
read the input into a buffer and displayed it in the data text widget.
The replay/build script copies a fixed version of this file into the callouts_c directory. This working, and fully commented file, is in
There are two parts to the file callouts_c/myData.c. The first is the myData() function, which checks the Mime type of the incoming data, sets up a callback on the HREF attribute of an <A> anchor, calls the parser engine, and then calls a routine to update the data area. This is the most important part of the example, as it shows how you can process a data input stream as easily as you would set up a callback in X/Motif. The second part of the example is the callback itself, which adds the link data to a list, and the update routine that takes the list and displays it in the data text widget. |
the myData() routine |
---|
In order to use the parsing engine API, you must include the header file
Since we chose to override the default handler for data returned from the Web, X-Designer generated a stub giving access to the group, the mime-type of the data, the InputStream itself, and the length of the data. Note that the length is not always valid. Quite often the value returned from a web server will be -1. Here is the stub of MyData() as generated:
|
Using the Parsing Engine | |
---|---|
There are three steps to using the parsing engine: | |
Step 1 | Get a handle for a parser object. This is
just like getting a file descriptor, or a DIR*, eg
|
Step 2 | Register your callback, for example
the routine getlinkinfo(), and the client data "href". The client data would
more likely be some data pointer.
|
Step 3 | Invoke the parser engine, passing it your handle,
the InputStream.
|
The final version of myData() |
---|
Here is the all the code necessary to parse the HTML stream, as used in the myData() routine of the example: |
#include <SGML.h>
int
myData ( sc_data_t * data )
{
extern int getlinkinfo();
mygroup_t * group = (mygroup_t*)data->group;
char * type = data->content_type; /* mime type */
InputStream i = (InputStream) data->data;
int len = data->content_length;
/* the code you would write starts here*/
SGML_t * sgm; /* an opaque handle used by the parser. */
/* like a FILE* or a DIR* */
if ( strcmp( type, "text/html") != 0) /* check the mime type of the */
return -1; /* data coming in */
sgm = scRegisterHTML( type); /* get a the parser object */
/* for that mime type */
(void) scAddTagCallback( sgm, "A", ON_ENTRY, getanchor, "a-call");
/* getanchor() is just an example */
(void) scAddAttrCallback( sgm, "A", "HREF", getlinkinfo, "href");
/* gelinkinfo will gather information on links */
/* as they appear in the HTML */
(void) scProcessSGML( sgm, i);
/* now process the InputStream i */
showMyLinks( group);
/* now we have parsed the data, display the results in */
/* the group->data text widget */
/* thats all */
return 0;
}