LibreOffice HTML tidy-up app
Clive Semmens (2335) 3276 posts |
Having to load the html file into Stronged, and save it again from there, is two extra operations. I’ve got Moonfish running so the Mac can save a great heap of html files into a directory on the Pi’s hard drive, then I can drag and drop them onto XP1LO2web, and the Mac can see the results immediately. I actually do most of the editing of html files in Atom on the Mac, but doing bulk tidying up like this is far better on the Pi. Another thing I do on the Pi is vector graphics, using !Draw and a number of BASIC programs I’ve written. I like the Draw app, and I particularly like the Draw file format. One of my BASIC programs translates Draw files into SVG for use on the Mac and on the Web. |
Steve Pampling (1551) 8170 posts |
It was the absence of a Well, being fair, there doesn’t seem to a |
Clive Semmens (2335) 3276 posts |
I don’t think I’ve ever seen an actual html file without both and tags before, but I could easily make it not look for the latter in the absence of the former if it’s really an issue.
|
Steve Drain (222) 1620 posts |
Likewise, and with the same apologies to Clive, here is the complete !RunImage for an application using BASIC with Basalt: *BasaltInit REGM @Message_DataLoad FNLoad REGT @SaveAs_SaveToFile FNSave INITIALISE "Resources" POLL END REM event handlers ------------------------------ DEFFNLoad IF BLOCK!(40)=&FAF THEN:REM filetype html (HTML$)=LOAD$(BLOCK$(44)) PROCReplace(HTML$,"<head>.*</head>","") PROCReplace(HTML$,"<body[^>]*","<body") PROCReplace(HTML$,"ICON=""\a*""","") PROCReplace(HTML$,"<p[^>]*","<p") PROCReplace(HTML$,"<h(\d)[^>]*","<h\1") PROCReplace(HTML$,"<!--.*--->","") 'SaveAs.ShowObject(4,Iconbar,-1) ENDIF =0 DEFFNSave SAVE (HTML$),BLOCK$(16) OF &FAF:REM save as html 'SaveAs.HideObject() =0 REM support routines ---------------------------- DEFPROCReplace(RETURN html$,search$,replace$) SEARCH (html$),search$ WHILE GROUPS (html$)=REPLACE$(replace$) SEARCH (html$),search$,GROUP(0,1):REM continue after found ENDWHILE ENDPROC This does require |
Rick Murray (539) 13840 posts |
<spits tea across the room> Kind of highlights what the other thread is talking about with the development environment being….lacking. Those three lines cover a wodge of normal code, and indeed your program itself is almost alien compared to standard BASIC (that’s RegEx, right?). I’m just trying to imagine how much boring boilerplate would be necessary to do that as a standard BASIC program, and without the pattern matching niceness, how big the replace function would need to be to do what those few lines do. Why isn’t this part of RISC OS?! |
Clive Semmens (2335) 3276 posts |
You can see all that in my app’s !RunImage – which has all that “boilerplate” and the big replace function… What does puzzle me is how Basalt provides for all the complicated clever stuff you can do by adjusting the details of the “boilerplate.” This particular application doesn’t do anything clever with the WIMP, but there’s lots you can do – and which I have done in other apps. (But most of my apps are just drag a file, it’s processed, and delivers another for you, without options.) |
Steve Drain (222) 1620 posts |
The Regex module has been around for a couple of decades and it is used by Basalt. The keywords make it relatively simple to access, but you could equally write BASIC routines to do much the same with the SWIs. The |
Steve Pampling (1551) 8170 posts |
“The” regex module? I thought there was more than one ported version. Be that as it may, integrating a version with a permissive licence1 really ought to be on someone’s porting list so the facility is in the OS for every user. Wonderful stuff regex, there’s a script running at work using an MS version. Been running several times a day for about 15-16 years. I’d hate to have had to write an extended version that did what the regex call does. 1 Mr Bellons port was a GPL item as I recall |
Steve Drain (222) 1620 posts |
I only know of one, which was written by Neil Bird and updated by Steffan.
I think the Regex Library is GNU to the core. ;-) |
Steve Pampling (1551) 8170 posts |
I think that regex may be GPL to the core. However, since the regex concept and implementations existed before GPL, any GPL claims might fall flat no matter what Stallman might say. Like I said “wonderful stuff”. |
Steve Drain (222) 1620 posts |
I have looked back at the StrongHelp version of the Regex Library manual that I made. It does support GNU, but also supports POSIX and UNIX versions, so I was overstating things. I also made a SH manual for the Regex module itself. |
Clive Semmens (2335) 3276 posts |
Steve Drain: your nice Basalt program does some things my app doesn’t do (remove comments, in particular, I notice – mine doesn’t remove comments, because LibreOffice doesn’t create any). But there are some complicated search and eliminate operations that my app does, and while I’m sure Basalt could do them more concisely than I’ve done it in BASIC, it’s not immediately obvious how. You’re selling Basalt very well – but I’m not sure who your target market is. Having a huge pile of boilerplate in my app is of no consequence to me: even a RISC PC wouldn’t have cared about a few kb of boilerplate, and the Pi certainly doesn’t; and I know how to edit it if I want to do something a bit different, whereas if I wanted to do something a bit different in Basalt I’ve got a learning curve to tackle. Maybe it would be a small learning curve, but what for? Similar comments about the actual guts of the program: saving a few lines, or even a few dozen lines, in a program is irrelevant. What I would be interested in would be making my app more widely useful, if anyone actually wants to use it. If others want to go off and write their own in the language of their choice, fine! To make it more widely useful, samples of html generated by other word processors (I only have LibreOffice) are what I need. A few examples (not all) of the kind of slightly more complicated operations my app does: 1) It looks for span tags that specify GB English or US English, and removes them, and the corresponding end tags – remembering that there could be intervening spans, so it’s not necessarily the next </span> you have to remove; also making sure that you don’t remove span tags specifying other languages; 2) It looks for font tags that specify Times face or Black colour, and removes them and the corresponding end tags, without removing other font tags or their end tags. 3) It looks for for footnotes, checks whether the link characters (from the text to the footnote and vice versa) are present or not, and inserts them if they’re missing. A LibreOffice quirk…sometimes it puts those link characters in, sometimes it mysteriously omits them! I’m sure all these things could be done perfectly well in Basalt, but I know how to do them in BASIC. It does these particular operations because these are the particular bits of crap produced by LibreOffice that obfuscate its HTML output, or are too prescriptive about how the material should appear on a web page. Samples of HTML produced by Microsoft Word might enable me to clear out crap from them, too, if anyone wants that. It’s irrelevant to me personally, since I don’t use Microsoft Word. |
Grahame Parish (436) 481 posts |
Word definitely does add huge swathes of redundant crap in its HTML output – it was the first thing I thought of when I started reading this thread. I don’t make use of Word for HTML production myself, but I’ve seen a lot of it… |
Steve Drain (222) 1620 posts |
The regex patterns are identical to the ones Gavin put in his Lua program. The main point of posting the program was to show how short a simple drag-and-save application could be. It is worth recalling that Acorn had FrontEnd to provide just this facility for running utilities on the desktop, and there have been other similar applications. Basalt itself is a bit of a dead end, because I tied the assembly to 26-bit RISC OS and the re-factoring for 32-bit has proved to be more than I can manage. There is a lesson here for a parallel topic. On the other hand, it does illustrate how BASIC could look at some unfathomable time in the future.;-) For myself, I now use BASIC libraries to support the Toolbox. These are even better at abbreviating a program than Basalt. With the current speed of processors there is no real advantage in machine code apart from doing things that cannot quite be achieved in interpreted BASIC without huge convolutions. |
Steve Drain (222) 1620 posts |
I think Gavin would agree, it is not Basalt that does the work, but Regex. Lua has it fully integrated, Basalt uses the module, but a BASIC program can also do it. To counter that, regex is a hill of it own to climb. ;-) |
Clive Semmens (2335) 3276 posts |
I realized that. A bit more elaborate than what I did to get rid of the ICON= garbage in NetScape Bookmarks files for Steve Pampling, but a good deal less elaborate than what I originally did to tidy up LibreOffice’s HTML output – or what I might do to tidy up Microsoft Word’s. |
John WILLIAMS (8368) 493 posts |
Two observations: ConvText from Paul Sprangers is a very useful application to have in one’s armoury, and Easi/TechWriter is not without sin with its HTML output last time I looked! |
Clive Semmens (2335) 3276 posts |
If anyone using Easi/TechWriter would like my app to handle its wicked HTML, feel free to send a sample 8~). Do tell what ConvText does, and where to get it! |
John WILLIAMS (8368) 493 posts |
It allows a script (a named part of a larger script file) to have deletions and substitutions within a (not necessarily) text file. I’ve used it to clean-up downloaded PHP scripts which invariably have hard-spaces and other undesirables, bank statements, DTP to use in e-mails etcetera. Latterly for converting the two character accent representations in Zoom chat files back to proper 8-bit characters. Its web page seems to have not been renewed (URL:http:www.riscos.sprie.nl/) and fallen off the web. Unfortunately it is compiled with ABC, so if it is lost, it’s really lost! Paul has posted here, last in 2013 if the search is correct, and then, unfortunately, about the ABC compiler! |
Chris Johnson (125) 825 posts |
I have used ConvText for many years as a major tool in converting the DDF output from Publisher to HTML, when producing the HTML version of Archive magazine for the late Jim Nagel, and previously for a short time for Paul Beverley. It works very well. It is easy to modify the scripts (necessary in my case because the Ed of Archive had a habit of introducing new styles with every issue!). |
Steve Pampling (1551) 8170 posts |
I think it’s a mindset. |
John WILLIAMS (8368) 493 posts |
As opposed to a minefield! |
Clive Semmens (2335) 3276 posts |
My email’s on my web page, John – http://clive.semmens.org.uk/index.html Sounds as though ConvText was quite similar to the filter app I wrote for the Physiology journals – it made HTML from DDF output from Publisher too 8~) – my filter is gone now too, though, until I update the assembler core from 26-bit…which I’m unlikely to bother to do, really. |
GavinWraith (26) 1563 posts |
Easi/TechWriter’s conversion to HTML unfortunately treats every embedded graphic as distinct, so that one tends to end up with lots of identical GIF files. In 2002 I wrote an application !Minim to clear up the mess and remove repeated files. It used AWK. Then in 2006 I wrote an updated application, which I renamed !FewerGif, which used Lua. There is a reason why AWK and Lua are more suitable for this job than BASIC or C: in AWK and Lua (and in most functional languages) string-comparison is optimized over string-updating – indeed you cannot update strings at all in AWK or Lua. !Minim and !FewerGif are of specialized applicability; if you want a copy let me know. When people say regex what they usually mean is pattern-matching or data-capture . The term regex abbreviates the phrase regular expression which is actually not technically correct, as not all of the patterns which are matchable by PERL REGular EXpressions are regular in the sense of the word as originally defined by Kleene (i.e. recognizable by a finite-state machine). Unfortunately Kleene’s notions were unrealistically mathematical; I mean that his definitions were based on sets (notoriously a problem for computing) rather than on ordered sets . PEGs, or Parser Expression Grammars, recognize a wider class of patterns than regular ones, and can be implemented more efficiently. Yes its a minefield. |
Steve Fryatt (216) 2105 posts |
There’s also ProcText, which is used in the production of The WROCC each month. Since it was a direct competitor to ConvText I never bothered to polish it up and release it, but these days it should be a reasonable DIY proposition for anyone with access to the GCCSDK. |