I really love search. It is part of my life. I use it everyday as a way of knowledge. And that is not Google about.
The most tricky one is corporate search. It varies, is not consistent, and depends. It depends upon type of resources which we need to index. Typically, these resources are: files, data and applications. The resources are transformed by search engines into indexes.
That is almost it. But variety of resource types requires different approaches. Some files/data records need additional processing before indexing. The other ones mutate and jump from one location/scope to another making search indexes obsolete and out of date. Real time changes do not trigger refreshing search indexes.
So how we can survive in that wild corporate search. As always, by adding another level of abstraction. The process is divided by phases: preparation, collecting, presenting.
Let’s assume that we have shared folders with files as a type of resources. In Windows world, shared files are presented by UNC path, like \\company\shared\folder\ . As a rule, corporate shared folders are empowered by distributed file system (DFS) allowing to hide physical implementation and present file resources as mounted roots. But that is not essential, let’s concentrate on UNC path. So, UNC root is entry point for file search. Search engine runs through collection of UNC roots and harvest data for indexing.
The next question – how search engine knows how to extract file content for indexing? The answer is – by file extension (or type). Text based files (.txt, .csv, .rtf) are easy ones. But other ones require more efforts. For example .pdf files could have text or scanned images inside. Some search filters (like Solr post.jar) are not able to process scanned pdf files. The other files requiry conversion before reading content (MS Office .doc, .ppt, .xls need to be converted to .docx, .pptx, .xlsx). Many .jpg files have plenty of attributes in EXIF format. And last but not least – content need to be cleaned and compressed before indexing.
So, raw data (original file) should be transformed in ready for search data in staging format. For sure .json file is the best candidate for that. It is de-facto standard allowing to deliver content to number of search systems. We need to generate json file in hidden sub folder. That is a key moment – when parent folder is renamed – content for indexing is not lost. If file is changes – .json copy is rebuilt based on file attributes. That’s kind of decoupling between source data and destination index. I use different PowerShell scripts for collecting json files and indexing. It allows to avoid many-to-many (M x N) and use many-to-one ( M + N). In other words I can deliver json file to many search engines without direct interaction of search with original resource. In practice, the whole (M + N) search process runs 5-10 times faster than (M x N).
The full picture, sitting in my head, is a little bit frustrating. But it works and looks quite representative:
I am going to present this concept in details to our local SQL Server User Group soon.