Alpha projects series

There are next projects:

  • HTML extracting by VB/VBA with IE and MS Excel
  • PDF2JPEG+PPT2JPEG conversion tool (VBA)
  • AllConferences downloader and extracter (Perl)

Using vbscript and vba to combine web pages in excel

http://www.getafreelancer.com/projects/Visual-Basic-Excel/using-vbscript-vba-combine-web.html

This is first project from this series. I called those project ~100$ as low-budget. But sometimes those projects are interested for me. And after I made this project I received more projects from the same customer, and some of them also were interested.

So, this project consist of three parts: downloader.vbs, checker.xls, and finalize.xls. Latest 2 units were made in MS VBA in Excel.

downloader.vbs

First script – downloader.vbs is a VB-script working in console mode. It use Internet Explorer in active mode. Using COM-interface it open the IE, then go to URL, then it send to it commands “Ctrl+A” (select all) and “Ctrl+C” (copy into clipboard).
The it created the Excel Spreadsheet and insert data into it by command “Ctrl+V” (paste).

So this script use a whole computer resources. So it's needed to use a standalone PC running Windows XP to run this script while it'll not ended.

Ok, as result it create a big MS Excel workbook consist of one or more spreadsheets.

checker.vba

This script work inside the MS Excel in background mode. So the resources of computer are free when the script work. It do the check between the old data and the new data downloaded by downloader.vbs. The repeating information is destroyed, and new – not.

Checker work inside VBA virtual machine. So it do it slowly. It can be several hours on core2duo processor.

Note: I prefer to use Perl for Win32 for similar operations. But in this case – I agreed with my customer to use very slow VBA machine under MS Excel. The advantage of VBA – here is a good debugger. And Perl have no debugger. Though it works in 5-6 times faster.

finalize.vba

This is the final script in this set. It take all rows made by checker.vba and added them into the big table made from previous sessions. After adding it sort the rows. The end.

Usage

Three scripts used to create the spreadsheet from the file “links.ini”.

“Links.ini”: one row = 1 URL

Spreadsheet “AllPages” have the next format (each column contains):

  • URL of the row in the table, usually there are many rows with the same url
  • Date of downloading in format YYYYMMDD (the date when 3 scripts were running)
  • Link# – the number of link in the spreadsheet “Links”
  • Local link – is the URL of the downloaded page on the local HDD
  • Here is columns “E:X” – the data from HTML page, as it was Copy-Pasted by the program from IE. So the original format of HTML-page saved and URL-s as well.
  • Column “Y” – the control sum of all symbols from “E:X” interval of cells. It used to optimize the work of the script “checker.vba” – if rows have a different control sum – they are different.

Spreadsheet “Links” have the next format:

  • Link# – the autoincrement column, it updated each time when the new one link from “Links.ini” is added. So “Links.ini” could be today one, tomorrow – another. The program calculate the URLs itself.
  • The URL
  • Validity – always “yes” (this is a reserved parameter)
  • Download status – “new” means it was downloaded in latest session, “same” means it's the old URL

Other spreadsheets: due of Excel limit on 65535 rows the script finalize.vba could to split the big spreadsheet into small like “allpages”, “allpages011”, etc.

This thing is used here: if the maximum rows is set the checker.vba try to found “AllPages_Link#” when it work with link#, for example when it work with link “12” it try to found this link first in “AllPages”, then in “AllPages012”, and so on.

The “finalize.vbs” — create new spreadsheets after the last spreadsheet have the rows number more then the limit.

So this package can work with unlimited set of links. I'm not sure how the MS Excel itself will work with 2-5 spreadsheets with 60000 rows in each. I think very slow even on Quad Core Pentium :-).

PDF2JPEG+PPT2JPEG conversion tool (VBA)

There is one VBA-script running in MS Powerpoint program.
PPT+PDF 2 IMG convertor. Written by YetiDi at December 2008. Readme.txt:

Usage

  1. Open pdf2img_unit.ppt
  2. Alt+F11
  3. Press F5 to Run an application

It seems this way is most quick and simple. It's because of Powerpoint presentation work in construct mode by default. So the button I made to run the UserForm – is inactive.

In the program

  1. Select appropriate resolution in Dpi
  2. Click “Open”
  3. Select “*.ppt” or “*.pdf” file
  4. The program will convert a whole document into the subdirectory “FileName_img”.
    Each Slide will have a name “Slide#.jpg”.

Note: before to use it for PDF 2 IMG convertion you should to install Freeware Aladdin Ghostscript driver. It allow to convert any PDF or EPS/PS file to any image with any resolution. The GhostScript is a cross-platform application used mostly in UNIX-based applications.

Ghostscript installation

  1. Run it to install
  2. Place the full path to 4th row of ppt2img.ini, for example “C:\Program Files\gs\gs8.63\bin\gswin32.exe”.

About MagicDpi value

MS Powerpoint use constant value of dpi for all documents. For example, A4 in MS Powerpoint 2002 (XP) have size 540×720 (there is 72 dpi). This moment is described here – http://www.pptfaq.com/FAQ00052.htm

AllConferences downloader and extracter

This project was made in Perl for Win32. So it work on the PC running Windows XP. It consist of two parts: the wget.bat (WGet downloader running console mode is used) and update.bat (the Perl-script also running in console mode).

The “wget.bat” make the mirror of the website “AllConferences.com” on the local hard disk drive (hdd). This work can proceed many hours based on WAN connection speed of the computer where it used.

The “update.bat” make the full syntax processing of all downloaded pages and create the big table with the data extracted from each conference.

Full project is packed into the self-installing package by InnoSetup. So user could use it in a set of mouse-clicks.

The confs project Readme.txt

Confs_project made by YetiDi at 2009-03-01 (info@yetidi.net)

I. Installation

Install the package – run confs_setup. InnoSetup for the confs project

All needed files will be unpacked into the directory: c:\Confs.

Perl – the Perl 5.0 for Win32 with the script
db – the web-spider wget + the 68Mb-sized part of the site allconferences.com
bat – the bat-files used for this work.

II. Make the mirror of the website

How to make a mirror of conferences website – run wget.bat and wait while it'll completed (this can be about 1-2 hours on the 1Mbit/s Internet connection).

If you want you could use another program to do it for example Teleport Ultra. It's not freeware and works quicker then freeware wget.

Anyway the mirrored website should be in the path: C:\Confs\db\all.conferences.com\

III. Make update

There are two variants:
Create_full.bat – all YYYY.htm(txt) files recreates from zero.
Update.bat – it use file “db\keyids.txt” with the keys catching from the last full/or not update. if any new keys will be found – the files upd_YYYY.htm(txt) will be generated.

IV. Quick start

  1. Install the package
  2. Run wget.bat (it's can be long)
  3. Run create_full.bat (about 10-15 minutes on the Intel Pentium Dual-Core 1.8GHz)

V. Output files

all.txt – full list of conferences in one big text file with tabulates
all1.htm – the same but HTML
2000.txt, 2000.htm – the list for 2000 year only

2015.txt, 2015.htm – the list for 2015 year only (deleted if zero)
keyids.txt – the list of keys made from latest update
keyids_YYYY-MM-DD.txt – the list of keys made at YYYY-MM-DD
upd_YYYY.htm, upd_YYYY.txt – the list made after update.bat was running

Format of each row in HTML or Text lists:
Columns 0..18 – the information from conference description file
Column 19, 20 – an extra e-mails and phones (pattern: ###-###-#### or +#..)
Column 21 – the date when the update was running
Column 22 – the key of conference with the URL to its original webpage

Also the HTML-page contains links for categories and places.

Note: the links to places/cities are made from conference source fields and sometimes there are wrong links. Any crossover control could decrease the speed of script. The current speed to update the full website with ~27900 conferences is 12 minutes on the Pentium Core-Duo T2370.

alpha.txt · Last modified: 2011/09/30 11:21 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki