There are next projects:
http://www.getafreelancer.com/projects/Visual-Basic-Excel/using-vbscript-vba-combine-web.html
This is first project from this series. I called those project ~100$ as low-budget. But sometimes those projects are interested for me. And after I made this project I received more projects from the same customer, and some of them also were interested.
So, this project consist of three parts: downloader.vbs, checker.xls, and finalize.xls. Latest 2 units were made in MS VBA in Excel.
First script – downloader.vbs is a VB-script working in console mode. It use Internet Explorer in active mode. Using COM-interface it open the IE, then go to URL, then it send to it commands “Ctrl+A” (select all) and “Ctrl+C” (copy into clipboard).
The it created the Excel Spreadsheet and insert data into it by command “Ctrl+V” (paste).
So this script use a whole computer resources. So it's needed to use a standalone PC running Windows XP to run this script while it'll not ended.
Ok, as result it create a big MS Excel workbook consist of one or more spreadsheets.
This script work inside the MS Excel in background mode. So the resources of computer are free when the script work. It do the check between the old data and the new data downloaded by downloader.vbs. The repeating information is destroyed, and new – not.
Checker work inside VBA virtual machine. So it do it slowly. It can be several hours on core2duo processor.
Note: I prefer to use Perl for Win32 for similar operations. But in this case – I agreed with my customer to use very slow VBA machine under MS Excel. The advantage of VBA – here is a good debugger. And Perl have no debugger. Though it works in 5-6 times faster.
This is the final script in this set. It take all rows made by checker.vba and added them into the big table made from previous sessions. After adding it sort the rows. The end.
Three scripts used to create the spreadsheet from the file “links.ini”.
“Links.ini”: one row = 1 URL
Spreadsheet “AllPages” have the next format (each column contains):
Spreadsheet “Links” have the next format:
Other spreadsheets: due of Excel limit on 65535 rows the script finalize.vba could to split the big spreadsheet into small like “allpages”, “allpages011”, etc.
This thing is used here: if the maximum rows is set the checker.vba try to found “AllPages_Link#” when it work with link#, for example when it work with link “12” it try to found this link first in “AllPages”, then in “AllPages012”, and so on.
The “finalize.vbs” — create new spreadsheets after the last spreadsheet have the rows number more then the limit.
So this package can work with unlimited set of links. I'm not sure how the MS Excel itself will work with 2-5 spreadsheets with 60000 rows in each. I think very slow even on Quad Core Pentium .
There is one VBA-script running in MS Powerpoint program.
PPT+PDF 2 IMG convertor. Written by YetiDi at December 2008. Readme.txt:
It seems this way is most quick and simple. It's because of Powerpoint presentation work in construct mode by default. So the button I made to run the UserForm – is inactive.
Note: before to use it for PDF 2 IMG convertion you should to install Freeware Aladdin Ghostscript driver. It allow to convert any PDF or EPS/PS file to any image with any resolution. The GhostScript is a cross-platform application used mostly in UNIX-based applications.
MS Powerpoint use constant value of dpi for all documents. For example, A4 in MS Powerpoint 2002 (XP) have size 540×720 (there is 72 dpi). This moment is described here – http://www.pptfaq.com/FAQ00052.htm
This project was made in Perl for Win32. So it work on the PC running Windows XP. It consist of two parts: the wget.bat (WGet downloader running console mode is used) and update.bat (the Perl-script also running in console mode).
The “wget.bat” make the mirror of the website “AllConferences.com” on the local hard disk drive (hdd). This work can proceed many hours based on WAN connection speed of the computer where it used.
The “update.bat” make the full syntax processing of all downloaded pages and create the big table with the data extracted from each conference.
Full project is packed into the self-installing package by InnoSetup. So user could use it in a set of mouse-clicks.
Confs_project made by YetiDi at 2009-03-01 (info@yetidi.net)
Install the package – run confs_setup.
All needed files will be unpacked into the directory: c:\Confs.
Perl – the Perl 5.0 for Win32 with the script
db – the web-spider wget + the 68Mb-sized part of the site allconferences.com
bat – the bat-files used for this work.
How to make a mirror of conferences website – run wget.bat and wait while it'll completed (this can be about 1-2 hours on the 1Mbit/s Internet connection).
If you want you could use another program to do it for example Teleport Ultra. It's not freeware and works quicker then freeware wget.
Anyway the mirrored website should be in the path: C:\Confs\db\all.conferences.com\
There are two variants:
Create_full.bat – all YYYY.htm(txt) files recreates from zero.
Update.bat – it use file “db\keyids.txt” with the keys catching from the last full/or not update.
if any new keys will be found – the files upd_YYYY.htm(txt) will be generated.
all.txt – full list of conferences in one big text file with tabulates
all1.htm – the same but HTML
2000.txt, 2000.htm – the list for 2000 year only
…
2015.txt, 2015.htm – the list for 2015 year only (deleted if zero)
keyids.txt – the list of keys made from latest update
keyids_YYYY-MM-DD.txt – the list of keys made at YYYY-MM-DD
upd_YYYY.htm, upd_YYYY.txt – the list made after update.bat was running
Format of each row in HTML or Text lists:
Columns 0..18 – the information from conference description file
Column 19, 20 – an extra e-mails and phones (pattern: ###-###-#### or +#..)
Column 21 – the date when the update was running
Column 22 – the key of conference with the URL to its original webpage
Also the HTML-page contains links for categories and places.
Note: the links to places/cities are made from conference source fields and sometimes there are wrong links. Any crossover control could decrease the speed of script. The current speed to update the full website with ~27900 conferences is 12 minutes on the Pentium Core-Duo T2370.