I’ve finished reviewing the completed results of our latest MTurk batch. I’m happy with the quality but before we run another batch I would like to run what we have all the way through our data process, that is take these transcriptions all the way through to completed AI model outputs of summary data about them (prelim AI models). This is a test of my proposed process, if I find that something is not what I expect that might change how I want the Turks to do future transcriptions. Once the process is dialed-in brining in additional ads will be straightforward and fast.
Ovals represent processes, squares SQL tables.
-Scraper pulls ads from the FOTM website and stores them in the run_ad table. (100% done)
-Ads with no transcription are run through the Google Vision API to generate a draft transcript. (100% done)
-The draft transcript and image are run through MTurk for transcription edits which are stored in run_ad_processing. (process is 100% done, only ~10% of ads have been run through MTurk)
-There are multiple transcriptions per ad so the Best Transcription Selector uses AI to pick the best one and stores which is the best in run_assignment. (20% done with development)
-The best transcription per ad is processed through event generation which separates ads into events. For example one ad is run for multiple days and even sometimes in multiple newspapers, all these ads must be grouped together into one single event to avoid over-counting. (20% done with development)
I can send you updates as I develop and refine the process, once it’s all done we have 10,768 additional ads we can run though.
Let me know your thoughts.