Any document type that the publishing-API knows about can be added to our internal search. By default, all document types in internal search also get included in the GOV.UK sitemap, which tells external search engines about our content.

The app responsible for search is Rummager. Rummager listens to RabbitMQ messages about published documents to know when to index documents. For the new document type to be indexed, you need to add it to a whitelist.

Rummager has its own concept of document type, which represents the schema used to store documents in Elasticsearch (the search engine).

Normally, you’ll map your document type an existing rummager document type. If in doubt, use “edition” - this is used for most documents.

Then, modify mapped_document_types.yml with the mapping from the publishing API document type.

If you want a search to be able to use metadata that isn’t defined in any rummager document type, then you’ll need to add new fields to rummager.

Rummager knows how to handle most of the core fields from the publishing platform, like title, description, and public_updated_at. It looks at the body or parts fields to work out what text to make searchable. If your schema uses different fields to render the text of the page, update the IndexableContentPresenter as well.

The part of rummager that translates between publishing API fields and search fields are elasticsearch_presenter.rb. Modify this if there is anything special you want a search to do with your documents (for example: appending additional information to the title).

2. Add the document type to migrated_formats.yaml

Add the document_type name to the migrated list in rummager.

3. Reindex

Reindex the govuk index following the instructions in Reindex an Elasticsearch index

4. Republish all the documents

Republish all the documents. If they have been published already, you can republish them with the publishing-api represent_downstream rake task:

rake represent_downstream:document_type[new_document_type]

You can test that the documents appear in search through the API using a query such as:

Source: This article was published docs.publishing.service.gov.uk 

Categorized in Search Engine

MANHATTAN BEACH, Calif., March 7, 2017 /PRNewswire-iReach/ -- Measured Search Inc, provider of the leading open source search-as-a-service solutions, today announced the availability of their new Elasticsearch Service offering. Elasticsearch, the popular open source search engine has been growing impressively over the last few years in both developer adoption and product features. Measured Search's Elasticsearch Service offers key features focused on enterprise customers:

Pick a Cloud...Any Cloud

Measured Search's Elasticsearch Service is Cloud agnostic. Host, deploy, manage and scale Elasticsearch applications in any cloud (including Amazon Web Services, Microsoft Azure, or Google Cloud). Enterprises choose cloud vendors based on different criteria. Those criteria can change over time - as they change, so too can the cloud provider.

Managed Services with SLA-backed Guarantees

Get 24x7x365 comprehensive and SLA-backed Managed Services from Elasticsearch experts. They are only a call or an email away – literally, anytime. Enjoy peace of mind with fully managed Elasticsearch Service.

Actionable Insights

Detailed query level metrics allows users to gain insights around what your users are searching for and how you can optimize search relevance. Get search conversion analytics, query level details and session level analytics to discover and track areas of improvement that can lead to increased click throughs and revenue.

Custom Plugin Support

Want to add the latest plugin or add a custom plugin to your Elasticsearch cluster? Elasticsearch Service by Measured Search supports custom plugins through their developer support.

"We've seen some of our customers struggle with their hosted Elasticsearch applications and felt there was a strong need for a fully managed Elasticsearch Service solution. They want to be able to focus on the interesting aspects of their job: application development and relevance tuning. And they want to leave the maintenance, support, care and feeding of the search infrastructure and tooling to someone else. Over the last year, we've really grown and learned from our Solr-as-a-Service customers and we're applying these lessons learned to our customers who are utilizing Elasticsearch."

-Sameer Maggon, CEO Measured Search

About Measured Search

Measured Search® enables companies to elevate the experience of Apache Solr or Elasticsearch based search applications faster and with more confidence. SearchStax® by Measured Search is a leading cloud orchestration, management and analytics platform for Open Source Search. Delivering cloud agnostic search as a managed service, Measured Search offers software and services that automate Solr or Elasticsearch management and administration in the cloud, improves stability and performance, provides comprehensive end-user search analytics, and on demand Search expertise.

Media Contact: Bing Gin, Measured Search, Inc., 844-973-2724, This email address is being protected from spambots. You need JavaScript enabled to view it.

News distributed by PR Newswire iReach: https://ireach.prnewswire.com

Source : http://finance.yahoo.com/news/measured-search-launches-fully-managed-160000573.html



Categorized in Search Engine

Since last week, ransomware attacks on Elasticsearch have quadrupled. Just like the MongoDB ransomware assaults of several weeks ago, Elasticsearch incursions are accelerating at a rapid rate.

The vast majority of vulnerable Elasticsearch servers are open on Amazon Web Services.John Matherly

There are an estimated 35,000 Elasticsearch clusters open to attack. Of these, Niall Merrigan, a solution architect who has been reporting on the attack numbers on Twitter, states that over 4,600 of them have been compromised.

If your Elasticsearch server is hacked, you'll find your data indices gone and replaced with a single index warning. The first example read:


In return for the .2 BitCoins (not quite $175), you might get your data back.

Elasticsearch is a popular, open-source distributed RESTful search engine. When used with the Lucene search-engine library, it's used by major websites such as Pandora, SoundCloud, and Wikipedia for search functionality. When used by amateurs without any security skills, it's simple to crack.

These wide-open to attack instances are typically being deployed without much on Amazon Web Services (AWS) clouds. Perhaps the people deploying them are under the illusion that AWS is protecting them. Wrong.

AWS does tell you how to protect your AWS Elasticsearch instances, but you still have to do the work. In short, RTFM.

The worst thing about this? Just like the MongoDB attacks, none of this would have happened if its programmers had protected its instances with basic, well-known security measures.

For starters, as Elasticsearch consultant Itamar Syn-Hershko explained in a blog on how to protect yourself against Elasticsearch attacks: "Whatever you do, never expose your cluster nodes to the web. This sounds obvious, but evidently this isn't done by all. Your cluster should never-ever be exposed to the public web."

In a word, "duh!"

Elasticsearch was never meant to be wide-open to internet users. Elastic, the company behind Elasticsearch, explained all this in 2013. This post is filled with such red-letter warnings as "Elasticsearch has no concept of a user." Essentially, anyone that can send arbitrary requests to your cluster is a "super user."

Does this sound like a system you should leave wide-open on the internet for any Tom, Dick, or Harry to play with? I don't think so!

So, what can you do? First, if you're using Elasticsearch for business, bite the bullet and get the commerical version of Elasticsearch. Then, add X-Pack Security to your setup and implement its security features.

By itself, Elasticsearch has no security. You must add it on.

If you're committed to doing it on your own, practice basic security. At a bare minimum this includes:

  • Don't run on internet-accessible servers.
  • If you make your Elasticsearch cluster internet accessible, restrict access to it via firewall, virtual private network (VPN), or a reverse proxy.
  • Perform backups of your data to a secure location and consider using Curator snapshots

In short, practice security 101, and don't be the fool who lets anyone invade their servers. After all, you could very well end up paying a lot more than just some petty-cash if a truly malicious hacker came by to raid your servers.

Author: Steven J. Vaughan-Nichols
Source: http://www.zdnet.com/article/elasticsearch-ransomware-attacks-now-number-in-the-thousands

Categorized in Internet Ethics

ElasticSearch 5.0 has been updated with new indexing, improved searching and read-write support.


Elasticsearch 5.0 was released last week, as part of a wider release of the Elastic Stack which lines-up version numbers of all the stack products. Kibana, Logstash, Beats, Elasticsearch - are all version 5.0 now. This release is quite a large one, and includes thousands of change items. I personally find this release exciting.

It's quite easy to get lost in the details due to the sheer number of changes in this release. In this post I will summarize the items I see as important, with some of my own commentary and advice. Hopefully, it will shed some light on where Elastic is standing and where they are headed.

This first post is focusing on search related topic . Future posts will focus on indices and cluster management, data ingestion capabilities, new debugging tools, ad-hoc batch processing, and more.

Full-text Search

One fundamental feature of Elasticsearch is scoring - or results ranking by relevance. The part that handles it is a Lucene component called Similarity. ES 5.0 now makes Okapi BM25 the default similarity and that's quite an important change. The default has long been tf/idf, which is both simpler to understand but easier to be fooled by rogue results. BM25 is a probabalistic approach to ranking that almost always gives better results than the more vanilla tf/idf. I've been recommending customers to use BM25 over tf/idf for a long time now, and we also rely on it at Forter for doing quite a lot of interesting stuff. Overall, a good move by ES and I can finally archive a year's long advise. Britta Weber has a great talk on explaining the difference, and BM25 in particular, definitely a recommended watch.

Another good change is simplifying access to analyzed/not-analyzed fields. Often times you need to avoid tokenizing string fields because you want to be able to look for them as-is, or need to use them from aggregations or to sort by them - even if they include spaces or weird characters. Instead of calling both "string fields", they are now text (analyzed) and keyword (not-analyzed). This should improve readability of mappings and accessibility of that feature. The only remaining item in my opinion is the not-tokenized-but-lowercased case - it is common enough but will still require some rigorous configuration. It probably makes sense now to allow specifying "token-filters" to execute on "keyword" fields directly in that field's mapping; luckily there seems work on that is already underway.

While on this topic, one advice - if you need to lowercase keyword-type fields, you probably want to also asciifold them.

Better Search Due to Low-Level Indexing Tweaks

Historically, Elasticsearch is a text-search engine. When search for numeric values and ranges was added, it was still using string-matching based search by translating the numerics to something searchable also on ranges. Same goes for geo-spatial search - Elasticsearch (rather, the underlying Lucene engine) required a translation from whatever into a string to make it searchable.

Starting in ES 5.0 every index now also has a k-d tree, and that data-structure is where search is performed for all non-string fields instead of the string-based inverted index. This means numbers, geo-spatial points and shapes, and now even IPv6 (IPv4 was already supported before) are indexed natively and searches on them - including ranges - is multiple times faster than before.


You should be expecting to see more sophisticated geo-spatial queries, aggregations and other operations also thanks to Lucene's LatLonPoint which highly optimizes memory and disk footprints and search and indexing speeds. WKT supportsearches on 2D shapes, 3D and even 4D+ shape search, adding dimensions from other sources (geo-spatial + some other metric collected from some datasource for example), interesting applications of nearest-neighbor searches, and more. The underlying libraries support many of them already and I've been hearing quite a lot of request for such capabilities. With this significant performance boost I reckon they will be finally exposed.

Lastly, since every value type which can be encoded as an ordered byte[] of fixed length can be searchable via k-d trees, we will probably start seeing some new types of data being indexed into Elasticsearch.

Read-your-write Support

Anyone who ever wrote a CRUD-type application with eventually-consistent databases is familiar with the common gotcha of posting a form and then not seeing the new piece of data in the listing page, being confused for a moment and then refresh the page a second later and see it. This is annoying in back-end applications used internally, but can be terrible user experience if experienced by your end users.

Elasticsearch indexes are eventually-consistent. The search is officially defined as "near-real-time", or in other words - don't expect to immediately see the document you just added in search results. It can appear within one second (the index refresh rate), or a bit longer if you happen to query a replica.

Until now there wasn't a good way to know when to display the listing page after a successful form post. Adding a synthetic wait is just not deterministic enough and to be frank is quite a code smell, and forcing a refresh on write isn't recommended for many reasons.

ES5 adds the ability to wait for refresh on a query. If you specify ?refresh=wait_for on any index, update, or delete request, the request will block until a refresh has happened and the change is visible to search. If too many requests are queued up, it will force a refresh to clear out the queue. The refresh is awaited cluster wide - primaries and replicas.

Next Up

I will be posting more posts about Elastic 5.0 focusing on more interesting capabilities, like new debugging enablers, batch processing support, index management improvements, data ingestion architectures and more. Stay tuned!

Furthermore, check out my Elasticsearch courses — currently running in London and Israel via BigData Boutique, for developers and operations.

Source : dzone


Categorized in Search Engine

Key takeaways

  • It’s important to determine what “relevant” actually means in each search use case.
  • Engage in a virtuous cycle of improving and reevaluating relevance against user expectations.
  • Focus on search evaluation over search solutions.
  • The future of design is Information Retrieval.
  • Improvements in Elasticsearch and Solr will make developing recommendations a far less daunting prospect


In their book, Relevant Search, Doug Turnbull and John Berryman show how to tackle the challenges of search engine relevance tuning in a fun and approachable manner.

The book focuses on Elasticsearch and Solr and how relevance engineers can use those tools to build search applications that appropriately suit the needs of the business and customer. In a good search engine, results are ranked basked not just on factual criteria but on the relevance to the end user. Success and failure largely depend on improving the ranking number. The book puts it a little differently: "The majority of this book is about the best way to take mastery over this one number!"

Unfortunately, there's no one right answer.

Using comfortable examples in a way that Star Trek fans everywhere will enjoy, the book lays out the difficulty in determining what the user actually wants to search for. This book isn't really meant for absolute beginners. Those with some experience will get more out of the examples and theory.

InfoQ spoke with Turnbull to discuss what it means to be a relevance engineer.

InfoQ: What is a typical day like for a relevance engineer?

Doug Turnbull: When I'm focused on a relevance project, a couple of tasks tend to dominate my time.

First and foremost, I'm trying to figure out what "relevant" means in this application. Every search use case is different. Is this application or use case focused on research where users will review every search result in-depth? Or is this more of a single item lookup, where the top result must be the right answer? Are the users experts in this domain, willing to use complicated search syntax? Or are they the average public, expecting Google-like magic? Sometimes this means digging into user analytics. How often are users abandoning searches? Do they achieve goals/conversions? Other times this means collaborating with domain experts to understand what users are doing.

Second, before I get to improving search, I need to ensure the impact of my changes can be measured real-time. This means turning the intelligence I've gathered from the last paragraph into some kind-of test-driven workbench such as Quepid. Using such a tool I can play with a broad range of solutions from simple relevance tweaks to advanced NLP or machine learning and get an instant sense whether they help across my most important searches.

Third, there's the actual hands-on relevance work. There's no "one size fits all" here. Some problems can be fixed by just tweaking how the search engine is queried. Other problems require more complex work. Perhaps it's important to understand parts of speech? Or perhaps you're searching twitter and hashtags need their own special treatment? Perhaps that magic solution a vendor is just the ticket -- perhaps not!

Fourth, there's issues around user experience outside the strict ranking of search results that help people find what they need. One big mistake people make is focusing on actual relevance -- that is just the ranking of results -- and ignoring perceived relevance. By perceived relevance I mean helping the user understand why a result is relevant. This involves ensuring your content has descriptive titles and that you are highlighting the matched keywords in context. Other features like faceting, autocomplete, and spell checking also help users find their way to what they need without needing 100% perfect actual relevance.

Finally, and this may surprise people, there's ensuring that changes can be released quickly. Is search deployed in a way that relevance changes can be rolled out incrementally? These are ops concerns, but they matter quite a lot to relevance. You need to get your changes out quickly and reevaluate whether or not changes you suspect will have a positive impact.


InfoQ: How does the work of a relevance engineer change over time? Is it ever possible to "set it and forget it"? 

Turnbull: In some cases, you might "set it and forget it." For example, you get to a good enough point for a search against the corpus of William Shakespeare. The documents don't change. The queries rarely do.

More importantly, there's only so much business incentive to have a tremendous search of the works of William Shakespeare.

More typically things change. Everything from user expectations to the design of your application. The kids start using a new lingo. There's different products on your online store. Old products are no longer for sale.

We like to talk about engaging in a virtuous cycle of constantly improving and reevaluating relevance against your user expectations. That's the more typical case: the case where search matters and you're constantly adjusting.

Most importantly, you want to adjust incrementally and quickly, not slowly/whole hog. Ship changes in small batches and reevaluate your analytics. Be OK with rolling back if that change didn't work out. In other words prefer being "agile" over a "big bang" relevance strategy. You pretty much have to be: releasing relevance tweaks based on last summer's catalog/users won't really help if released this winter.

InfoQ: What are some of the most common mistakes? 

Turnbull: To me the biggest mistake is focusing on solutions over search evaluation.

In my work, the hardest part of relevance, has been measuring whether search is relevant. Answering questions like: Are these the right results? Are users happy with them? Are these results what users expect from this search application? In this context? Are we making progress with our tweaks?

The relevance engineer isn't really equipped to know whether or not search results are relevant. Instead they need to work with non-technical colleagues and domain experts to analyze user data and evaluate search correctness. It's really really hard, and even the best analytics available can be misleading in the wrong context. Sometimes the application is so specialized analytics are useless entirely!

Instead of putting evaluation first, unfortunately, many organizations go straight to silver-bullet solutions or exciting new technologies. They don't invest the time to setup a practice of evaluation. For example, one popular technology that might apply to some forms of search could be applying a technology called word2vec to your search documents. Word2vec is a machine-learning approach to understand the semantic relationships behind words. Understanding that "prince" and "king" are really closely related. Or that "Anakin Skywalker" and "Darth Vader" are the same person. It perks our ears as engineers to think "oh if I search for 'Darth Vader' then I'll match 'Anakin Skywalker". But it may either be "just the thing" that solves this particular search problem, or it may be completely irrelevant to the job at hand.

Organizations that take search really seriously can never outsource search evaluation. In the book, we write about ways to interpret analytics and techniques like test driven relevance that can help address these problems. And when you're really good and measuring relevance, then you can begin to use advanced techniques like learning to rank.


InfoQ: Is the problem of search too hard for generalists to handle? 

Turnbull: Another way to think about this question would be the evolution of Web design. In the early Web, very few held the title "Web Designer." I was a generalist programmer, for example, and I made an HTML page look nice by trying to organize img tags within table elements in the early 2000s. I didn't stop to think to hire a designer. Why? These were still the early days of the Web. The Web interaction was new. After time, consumers developed taste for what a good user interaction with a Website looked like. So today any reasonable Website invests as much in design as they do "generalist programming."

I see search in a similar vein. Google has given us high expectations on good search interaction. Yet there's so much that's domain or app specific. Searching through medical articles to help doctors diagnose patients looks nothing like searching your e-commerce catalog for good deals. Which looks nothing like researching the news of the late 19th century. Some of the widgets looks the same: the autocomplete, the search bar, the highlighted search results. But the small details behind every keystroke in the search bar can change a great deal. The way search results are ordered can change a great deal. This is a new kind of "designer" that focuses on relevance and conversational interactivity behind search.

And this is the tip of the iceberg. A lot of the machine learning craze today is really about problems of Information Retrieval: some kind of process that returns results ranked by relevance personalized for a user. Many startups drive chat bots through really smart search. Recommendation systems are increasingly being built with Elasticsearch.

These are the interaction paradigms of tomorrow. The "future" of design is Information Retrieval. Solr and Elasticsearch have all the tools in their extensive toolbelt to lead the next generation of design and interactivity.

InfoQ: Right now, Elasticsearch and Solr seem to have vast feature sets. What's on the horizon? How is the technology changing over time.

Turnbull: Elasticsearch and Solr are both making strides as a simpler framework forbuilding recommendation systems. I think this will make developing recommendations a far less daunting prospect for many medium-sized businesses. In the same way Elasticsearch and Solr made it easier to implement search, I think that being able to take a single open source tool off the shelf that already knows how to rank results based on relevance means avoiding complex and expensive machine-learning solutions. My coauthor, John Berryman, for instance actually does this: he's built arecommendation system using Elasticsearch for Eventbrite. But there's more being done to help. This includes Trey Granger's knowledge graph for Solr and Elastic'sgraph product.

InfoQ: Other than reading the book, what resource should budding relevance engineers seek out?


Turnbull: Well first, don't hesitate to seek me out. I'm a consultant, and really enjoy speaking -- so I seek to be sought out! In particular, I do free one hour knowledge sharing/lunch and learn events in case you're having trouble staffing your company's events.

For great search blogs, I write a lot on my company's blog. Sujit Pal, writes a ton of interesting material on his blog that you'd also find interesting. We should all get on my coauthor's case and get him to write more because when we worked together hecontributed tremendous content to our blog.

For relevance/information retrieval books, you'll definitely want to read Introduction to Information RetrievalTaming Text is another great read that blends search and Natural Language Processing. I also have another book that builds on Relevant Search to teach a business-level flow to implementing relevance.

For search engine specific books, I'd recommend my colleague's Apache Solr Enterprise Search ServerSolr in Action, and Elasticsearch in Action.

Source : https://www.infoq.com


Categorized in Research Methods


World's leading professional association of Internet Research Specialists - We deliver Knowledge, Education, Training, and Certification in the field of Professional Online Research. The AOFIRS is considered a major contributor in improving Web Search Skills and recognizes Online Research work as a full-time occupation for those that use the Internet as their primary source of information.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.