Chapter 11

Big Data Discovery

Adam C. Losey

11.1 Introduction

In this chapter, you will learn

  • About the duty to preserve electronic evidence and how it has an impact on Big Data;
  • How to prevent common mistakes in preserving electronic evidence;
  • Common litigation hold triggers and how to spot and address Big Data preservation triggers that you might otherwise overlook;
  • The value of weaving automated preservation processes into Big Data analytics to ensure legal compliance, to protect corporate information, and to reduce risk;
  • How to effectively address, seek, and combat database discovery;
  • About clawback orders and how to use them to protect privilege; and
  • How to cost-effectively and efficiently review Big Data in the context of discovery requests by using computer-assisted review (CAR).

11.2 Big Data, Big Preservation Problems

The large data collections inherent in Big Data analysis have operational and strategic value. But, the preservation, collection, search, and review of Big Data can create big problems in the discovery process in litigation, starting before a lawsuit is even filed.

As of 2014, of all data in the world, 90% was created in the past two years, per IBM Analytics.1 Amounts of information previously unimaginable in size can now be stored for a relatively low cost. As an example, for less than $100, you can purchase a 2-terabyte hard drive, which holds 2,048 gigabytes. Assuming that 1 gigabyte holds about 100,000 printed pages of text (a conservative estimate), this $90 hard drive can easily hold 204,800,000 pages of text—about a 1.3-mile tall stack of paper if printed and plopped into a pen-and-ink inbox (see Figure 1.1, “Visualizing Big Data”).

Assume a lawyer can review one digital page of text a minute—a brisk pace. It will take the lawyer 20 years of daily 8-hour days (including weekends) to review the data that can be stored on this 2-terabyte hard drive. Even at extremely low hourly rates, using fixed fees or an offshore service, the review cost would be staggering. To put it in perspective, the amount of data that can be stored on a hard drive for about a nickel could cost hundreds of thousands of dollars for human eyes-on review.

This ratio of storage cost to review cost is the core of the electronic discovery problem facing modern litigants. The volume problem is made worse when coupled with the fact that most modern businesses have all sorts of different types of electronically stored information (from texts, to voice mails, to Internet Protocol logs) residing in hundreds or thousands of separate locations. Frequently, no one person at a company will truly know where all the relevant data resides. An attorney cannot simply ask a client to search for all files on a matter and meet applicable legal requirements for discovery due diligence. From preservation, to collection, search, and review, twentieth-century discovery methodologies must be adapted to handle twenty-first-century discovery in this era of Big Data.

11.3 Big Data Preservation

11.3.1 The Duty to Preserve: A Time-Tested Legal Doctrine Meets Big Data

In our legal system, once a person or business reasonably anticipates litigation, they generally have a duty to preserve information relevant to the reasonably anticipated litigation. This includes incriminating (and privileged) evidence that is harmful to the preserving party, as well as information that might be helpful. The preservation duty, to some extent, runs contrary to human nature and puts the litigant in the awkward position of being charged to take care to maintain information and data both harmful and helpful to the litigant’s cause. This has been the common law for hundreds of years, and spoliation (the destruction of evidence that a party had a duty to preserve) has been an issue all the way back to Dickensian England (see Armory v. Delamirie, 93 Eng. Rep. 664 (K.B. 1722)).2

Requiring a person to keep something harmful to their own cause runs contrary to a human tendency to want to prevent incriminating evidence, whether electronic or otherwise, from seeing the light of day. As a result, unfortunately, spoliation has been and remains a common issue in litigation. Because of the ability to conduct forensic analysis on hard drives and other sources of electronically stored information (as well as the proliferation of electronically stored information), it is now easier to detect and prove spoliation.

The duty to preserve requires corporate and outside counsel to educate clients on the duty to preserve and to shepherd clients through the preservation process. This process runs the gamut from deciding when the duty to preserve is triggered to ensuring that data is actually preserved. Given the exponential increase in amount of data and the diffuse and distributed nature of most network infrastructures, this is no easy feat. The stakes are high. The failure to preserve data once the duty to preserve is triggered has serious consequences. Spoliation sanctions, those levied based on the destruction of evidence, range widely and can result in a party losing litigation by default. Spoliation sanctions can come about simply by allowing critical data to be lost by operation of automated processes after the preservation duty has triggered.

A recent example involving automatic deletion of data leading to sanctions involved an employment claim.3 The plaintiff, Pillay, alleged that Millard, the refrigeration company he worked for, fired an employee named Ramirez because Millard believed Ramirez to be disabled, and then that Pillay was fired for complaining to Millard about Ramirez’s alleged wrongful termination.

Millard used a labor management system called “LMS[,]” to track its employees’ productivity and performance using performance analytics. In this case, Millard relied on LMS data to justify the termination of Ramirez and Pillay. Pillay argued that Millard regularly manipulated the LMS data and propounded discovery for the underlying LMS data to attempt to prove discrimination and manipulation of the LMS data. Pillay then learned that the LMS data was gone. How did this happen?

Pillay was fired in August 2008. In December 2008, Pillay’s attorney advised Millard “to preserve evidence and documents,” including electronic communications or data related to his client’s employment, specifically citing “[a]ll communications, documents, emails, or anything relating to Mr. Ramirez’s productivity and work evaluations.” In July 2010, Millard notified Pillay that the data used to calculate Ramirez’s LMS numbers had been deleted. Millard explained that “the discrete data [that formulates LMS numbers] is automatically deleted after one year” to keep its system operating at an optimal level. Apparently, no one at Millard flipped the “off” switch for the automated purge process, and no one bothered to archive the salient data for Ramirez and Pillay.

11.3.2 Avoiding Preservation Pitfalls

Pillay successfully moved for a spoliation inference based on the charge that Millard had destroyed critical evidence—a classic case of a company facing consequences from failing to take steps to avoid automated system purge processes. This is but one of many oft-seen preservation blunders. The following is a list of the top five common preservation pitfalls and how you can avoid them.

11.3.2.1 Failure to Flip the Off Switch

Most companies use some automatic deletion or overwrite policies or protocols to manage various types of data. Things like Internet Protocol (IP) logs, which detail access to computer systems, and Big Data troves (such as the LMS system in Pillay) are routinely overwritten after set retention periods as a matter of course. The same is true with emails and electronic files that have no business or regulatory compliance value. This automated deletion as part of a document retention and system optimization policy is generally fine and even protected under certain federal and state laws. For example, Federal Rule of Civil Procedure 37(e) provides that “[a]bsent exceptional circumstances, a court may not impose sanctions under these rules on a party for failing to provide electronically stored information lost as a result of the routine, good-faith operation of an electronic information system.” But, these protections do not allow deletion of data once the preservation duty triggers. How do you prevent automatic deletion blunders? Loop information technology personnel into the electronic evidence preservation process and take the time to educate them about the legal requirements. Ensure your counsel understands the nature of various sources of electronically stored information and where and how data is stored. Know what needs to be done to flip the off switch on any automatic deletion protocols for relevant information. Document that preservation steps have been taken regarding any evidence in question. If you choose not to stop the automated processes, alternate means to effectively preserve the evidence should be used; for example, you can typically export and archive the salient data. In the Pillay case, Millard could have at least archived Ramirez’s LMS data and kept the automatic deletion policy in place. This holds true even beyond the automated deletion example. For example, reissuing laptops from former employees to new employees and wiping the old information (which is the norm in many companies) can also lead to sanctions. “Once a ‘litigation hold’ has been established, a party cannot continue a routine procedure that effectively ensures that potentially relevant and readily available information is no longer ‘reasonably accessible.’” 4 Do not rely on information technology staff to understand what is or is not relevant to a claim; that is a legal judgment that should be left to lawyers. Conversely, corporate or outside counsel may not have the experience to know how corporate information systems function. Consequently, effective preservation requires open and frequent communication between information technology personnel, the business personnel, and counsel.

11.3.2.2 The Spreadsheet Error

Typically, companies send out litigation hold notices to individual custodians and information technology personnel letting them know to preserve information relevant to a claim and containing specific instructions particular to a claim. Tracking these litigation holds can be a challenge for large companies and in large litigations. When a company has a dozen or more litigations, each with potentially dozens of custodians and sources of information, keeping track of whose information was preserved, who received a hold, and where preserved information resides presents opportunities for error. The same can be true with a single large litigation; for example, it is not uncommon to have tens of thousands of custodians in a Fair Labor Standards Act class action and to have dozens of nonparties in control of data relevant to claims (e.g., payroll companies, cloud computing services, reimbursement vendors). In particular, when Excel spreadsheets are used to track litigation holds by manual data input, mistakes frequently happen. Through human error, spoliation can occur because the wrong data is entered into the wrong column or row (i.e., “yes” is entered on the wrong custodian for a field “sent litigation hold[,]” and as a result the custodian is never sent a hold and the data is lost). The solution? Buy or build a scalable litigation hold tracking solution that automates as much of the process as possible; including

  • Who received hold notices, what they contained, and when they were issued;
  • Calendaring follow-ups to custodians who do not respond to holds;
  • Monitoring and documenting preservation efforts, including what was preserved, who preserved it, where the data resides, and when it was preserved;
  • Other steps taken to alert individuals to their preservation responsibilities, including automated system modifications; and
  • Use of multiple providers that offer low-cost off-the-shelf litigation hold tracking solutions. Custom solutions are also an option, but these are typically more expensive. These systems should ideally be put in place before litigation occurs, as addressing the new system while a preservation obligation is in place is typically not an optimal environment to vet various products and options for use on a company-wide basis.

11.3.2.3 The Never-Ending Hold

The duty to preserve does not extend in perpetuity. It ends once the reasonable anticipation of litigation ends, which can occur after settlement, at the conclusion of the litigation and after the time period to file a notice of appeal has passed, or at the end of a statute of limitations for a claim. Yet, more often than not, litigation hold notices are forgotten about and never lifted even after the duty to preserve has ended. This is wasteful as often this litigation data has no business or compliance value. Waste aside, litigation holds can be headaches for individual employees—keeping everything related to a topic requires time and attention and can drain productivity that can otherwise be spent fulfilling business functions. Thus, you do not want to have a hold in place any longer than necessary. The solution is to make sure your litigation hold process includes a mechanism to lift holds when appropriate. This process should not be reliant on an individual, if possible, as individuals tend to come and go from employment because of typical attrition or the engagement of different counsel; typically, litigation hold tracking solutions include features to help ensure holds are eventually lifted.

11.3.2.4 The Fire and Forget

Many litigation holds are completely ignored by the recipients. Too often, companies and counsel see sending the litigation hold as a “fire-and-forget” task that is necessary to eliminate from a checklist, forgetting the ultimate point of a litigation hold: to ensure the recipients actually preserve the information in a timely manner. Simply sending the letter often does not accomplish this task: “[i]t is not sufficient . . . for a company merely to tell employees to ‘save relevant documents.’ . . . This sort of token effort will hardly ever suffice.”5 To ensure that holds are followed, make sure to

  • Send timely, routine follow-ups and contact key personnel to ensure that they have read and will comply with the hold;
  • Engage directly with information technology personnel to ensure that necessary information is actually collected;
  • Set deadlines for preservation and collection tasks for information technology personnel and ensure those deadlines are met;
  • Ask information technology personnel to send or preserve data to a specific location—and then verify the data is accessible and readable from that location;
  • Engage supervisory personnel or management in the litigation hold process; typically, employees pay close attention to correspondence when it comes from someone higher up in the organization, and having holds come from key personnel (or even cc’ing key personnel) can be helpful in encouraging compliance; and
  • Use telephone and in-person follow-ups with recipients, where feasible, which go a long way to encouraging compliance. Where feasible, picking up the phone at the outset to explain the process and the urgency of compliance is one of the best methods of encouraging employees to take the required steps.

11.3.2.5 Deputizing Custodians as Information Technology Personnel

Frequently, individual custodians are relied on to collect and preserve their own electronically stored information. Employees are thus deputized as information technology personnel and lawyers and charged with manually selecting and harvesting data relevant to a claim. This practice has been criticized both as a “fox guarding the henhouse,” whereby custodial self-bias would lead to withholding of incriminating information, and as requiring rank-and-file employees to jump in over their heads concerning the technical requirements of collection. In many cases, employees would not even have access to the tools that system administrators would have for preservation. For example, on a Microsoft Exchange Server, typically only an administrator can flag an account on “litigation hold” and prevent the deletion of email, which could also preserve emails that were deleted previously by the custodian. Finally, custodial self-collection can be distracting and detrimental to a business; if you do not know what you are doing, it can be time consuming and frustrating to try to collect and preserve electronically stored information. Although custodial self-collection is not prohibited by law in most circumstances, it is typically not the best practice for these reasons. The solution is simply to loop in information technology personnel to accomplish collection and to use information technology employees or experienced vendors to collect data. For small companies, many forensic companies offer relatively low-cost “plug-and-collect” devices so even technologically unsophisticated employees or managers can simply plug in a device and allow the device to automatically conduct forensically sound data collection.

11.3.3 Pulling the Litigation Hold Trigger

Although the Pillay case involved a clear trigger date of duty to preserve (i.e., Millard was on notice that the LMS data was relevant when it deleted the data), the reasonable anticipation of litigation standard is inherently ambiguous. The judicial determination of when a company reasonably anticipates litigation necessarily involves a subjective after-the-fact analysis by a judge or jury.

In the absence of an obvious bright-line litigation hold triggering event, such as the filing or service of a complaint, a court will consider a variety of variables to determine when this preservation duty arises. From a judicial standpoint, this variability is desirable, despite the potential for inconsistency. Attempts to produce clarity and uniformity through the imposition of a forced bright-line test could cause unnecessary rigidity in the preservation standard.

For example, a uniform rule that a party need only preserve data after it was sued or filed suit would create an opportunity for mischief (i.e., presuit “housecleaning” of incriminating data).6 However, for organizations, individuals, and their lawyers, this malleability makes it difficult to determine whether a litigation threat or an event is serious enough to trigger information preservation obligations.

The preservation duty is certainly triggered by the service of a lawsuit on a party; if you are sued and served, you obviously know litigation is happening. The old paradigm typically involves an event-specific analysis of the duty to preserve that most lawyers would recognize as potential litigation triggers. For example, an employee allegedly physically confronting a supervisor,7 a fatal accident,8 a notice that food products were contaminated,9 the service or filing of a complaint, or a plaintiff’s retention of counsel to sue are all recognizable and chronologically specific events that have triggered the duty to preserve.

Big Data brings the potential for a less-obvious and new paradigm on the reasonable anticipation of litigation standard. If a company has the resources and can use people and technology to quickly analyze large amounts of data of different types from a variety of sources to produce a stream of actionable knowledge, there is a question of how this heightened insight affects the preservation duty.

11.3.4 Big Data Preservation Triggers

The LMS system in the Pillay case provides an excellent example of the potential for a new Big Data preservation paradigm. The system measured productivity of employees, and this type of analytic tool can have an impact on the determination of when litigation is anticipated and a data hold is required. Predictive analytics of employee timekeeping records, productivity records, email accounts, and even Internet search history can also show when employees spend a large percentage of their time on personal or nonwork matters (e.g., sending a high volume of personal emails on the clock, shopping online, bantering about fantasy football).

Companies pay employees to work, not to dither online. Analytics that detect personal activity on company time to measure productivity are simultaneously identifying grounds for employee discipline or termination. Spending frequent time on the clock while browsing personal websites or shopping for personal items online can be a ground for termination. Employee productivity data analytics can then give rise to preservation obligations, and preservation of data that provides grounds for termination in an employment context is doubly important, as Pillay illustrates. The records showing an employee’s on-the-clock activity need to be kept in a manner that they can be used as evidence later and, where appropriate, need to be routed to counsel so they can make the legal determination of whether the analytic data triggers the legal duty to preserve.

Employee behavior analytics can also detect when employees are likely sending trade secret or confidential information outside the company. Large amounts of emails with multiple attachments sent in rapid succession, frequently without any information in the subject line or the text, are often a sign of a document grab for personal or competitive use. The behavior is typical, as most employees keep documents on a company system, computer, or hard drive, and many times employees attempt to download or transmit this information en masse is prior to leaving employment or seeking to sell the information to a competitor. More technologically sophisticated employees typically attempt to cover their tracks by using flash drives or attempting to delete sent items to avoid leaving easily trackable electronic trails, but even sophisticated employees who try to cover their tracks typically leave a followable trail of electronic evidence. In an exemplar case involving data theft, the day before an employee’s employment with the company was terminated, he “forwarded [confidential and trade secret information from his workplace] email to his personal email account, and that he used the information to recruit additional employees and agents on behalf of [a competitor].”10 This is, unfortunately, typical employee behavior.

Certainly large-scale document grabs for personal or competitive use not only are a ground for termination but also may require immediate attention by counsel to preserve and retrieve the confidential information. Once trade secret or confidential information escapes, it is difficult to mitigate the harm and to prevent the data from being used. Preventing the use or sale of proprietary and valuable information such as customer lists or trade secret data requires fast action. A preliminary injunction or other quick provisional remedy may be needed to prevent the use of the information and contain the damage.

As another example, the casualty insurance industry often provides insurance for individual or organizational negligent acts or omissions. As you may imagine, litigation costs and the potential for litigation are major factors in handling casualty claims. Because of this, some casualty insurers use Big Data techniques in applying litigation prediction applications to their claims system. This predictive know-how can help control claim costs, but it can also trigger the duty to preserve at a much earlier date.

For example, say an insurer received a claim request. This in itself would not necessarily trigger the duty to preserve because the receipt of an insurance claim without more may not mean that litigation is on the horizon. Many insurance claims resolve short of litigation. However, assume this same insurer received a claim request and using predictive analytics determined that there is a 20% chance that the claim will result in litigation. This data-driven knowledge could trigger the duty to preserve.

A litigation threat rising to the level of “reasonable anticipation” requires more than the mere possibility that litigation might occur.11 Litigation must actually be, to some extent, likely. The reasonable anticipation standard is also applied by lawyers, and lawyers are typically quicker than nonlawyers to see litigation lurking around every corner. A lawyer or judge may see an empty playground with no fencing around the playground, think “attractive nuisance[,]” and reason that the owner of the playground should reasonably anticipate litigation. A layman would more likely just see a swing set and would not equate an unfenced playground to a likely litigation risk.

At the 20% probability level, it is entirely possible that a court would hold that this percentage chance rose to the level of reasonable anticipation of litigation. At a 50% or greater level, it is likely that a court would find that the knowledge gleaned from the Big Data analytics created a reasonable anticipation of litigation on behalf of the casualty insurer. Although there is no set percentage threshold uniformly applied across the country as the minimum threshold for reasonable anticipation of litigation, whenever Big Data analytics are used, you should do the following:

  • Loop in the Lawyers. If your Big Data analysis involves employee performance, loop in a labor and employment lawyer to discuss what obligations you may have in conjunction with preservation. For insurance analytics, loop in an insurance lawyer for the same reason. Consult a legal subject matter expert in the early phases regarding whatever it is your analytic tool measures to determine what the potential compliance and legal obligations surrounding the analytic tool may be and implement measures to preserve data or offer input on the analytic process that are recommended by counsel.
  • Be Sure You Want and Need to Know. As far back as 2009, two seniors at the Massachusetts Institute of Technology (MIT) conducted a study whereby they were able to create an effective “method of classifying sexual orientation of individuals on Facebook, regardless of whether they chose to disclose that information. Facebook users who did not disclose their sexual orientation in their profiles would presumably consider the present research an invasion of privacy. Yet this research uses nothing more than information already publicly provided on Facebook; no interaction with subjects was required.”12 While an interesting study in data analytics, no corporate human resources department would condone this type of tool being used in the hiring process. Certainly, a plaintiff would argue that there would be no possible legitimate use of this tool in the workplace, and that any rationale given would be a pretext for discrimination. Thus, make sure you want to know—and have a need to know—whatever it is you are seeking to extract from Big Data.
  • Automate Preservation. Assuming you consult with counsel and determine your Big Data analytics could trigger preservation obligations, automate preservation to the extent possible. For example, if an employee’s productivity level were to drop below whatever a company deemed the lowest threshold, that employee’s email account could be automatically flagged on litigation hold, the employee’s Internet history saved permanently (to check to see if he or she was browsing the web on company time), and phone records and other documentation saved. Automating these functions takes out the lag time associated with human beings, as well as eliminates the room for error associated with human involvement in manual input tasks.
  • Be Quick. Particularly with data protection analytics targeted to employees, counsel needs to be alerted quickly to take the necessary triage actions, as well as preserve data relevant to the employee’s actions. Access logs and other data showing access to systems typically are overwritten on a regular basis, and preserving this information that has a short shelf life can be crucial in claims involving theft of confidential or trade secret information. Assuming automated processes are used that preserve information immediately on analytics hitting on a risk, immediate response time is readily achievable. Assuming human beings are used, steps need to be taken to ensure extremely fast response time to triage situations that ensure that you have appropriate time to act to stop the release or mitigate its impacts. If predictive analytics key on an employee theft of trade secret information or a data breach but aggressive steps, including seeking an injunction, are not taken immediately, that delay can be the difference between obtaining a court order to seize the data from the employee, notifying a bank of stolen credit card numbers quickly, or being faced with a Snowden-like situation in which the company information has been made public and the harm from disclosure is exponentially increased.

11.4 Big Database Discovery

Big database discovery can also present problems. Big Data analytics can be based on structured data, frequently contained in databases. Large structured databases present a number of challenges in discovery. Working with structured data (such as that contained in a traditional database) is different from working with unstructured data (such as a series of Word documents in a folder).

11.4.1 The Database Difference

Databases contain discrete categories of information, divided into individual fields. These structured categories of information are kept together collectively. The individual database fields in structured databases differ from typical unstructured data because, unlike unstructured data, structured data is typically not presented in the exact form that it was created. Rather, structured databases are composites of fields that only make sense through their interrelationship. Structured databases also typically have tightly defined parameters regarding how data is input, kept, and retrieved. As an example, a structured database may break a location into multiple elements that do not make sense when broken apart. The coordinates 34°59′20 N, 106°36′52 W refer to a specific location, but a structured database would store each numeral as a discrete element (34, 59, 20, 106, 36, and 52), with each element stored in a separate data field. Unlike a Word document or an Excel spreadsheet, each of these separate elements must refer to the others to make sense; only collectively are they coordinates as opposed to a numerical jumble.

The preservation of structured databases is challenging because most databases are active composites of various information. To put a “hold” on the use or modification of the database itself could be crippling to a business in which the database is used constantly to fulfill critical business functions. The structure of the database and its contents typically guide the methods used to ensure that relevant data is preserved. But, in the context of Big Data, limitation on the scope of preservation is particularly important. Simply because certain fields in a structured database are relevant does not mean the entire database is within the scope of the duty to preserve.

11.4.2 Databases in Litigation

Case law generally supports that a litigant will not gain access to another’s database simply because some of the data within a database is relevant to litigation, and typically parties will confer regarding targeted queries as opposed to wholesale production. Many databases contain personally identifiable information subject to data protection laws, as well as confidential or trade secret information.

Under the applicable Federal Rules of Civil Procedure, a litigant does not automatically receive unrestricted, direct access to a party’s database compilations. Instead, a requesting party can inspect and copy data relevant to the lawsuit.13

It is, however, possible to gain access to a database directly in the correct circumstances. A recent example of database litigation in a trademark infringement and fraud case resulted in just this type of access.14 The plaintiff in this case obtained a temporary restraining order against the defendants and sought expedited discovery in the form of a copy of defendants’ OS Commerce database. The defendants alleged this database contained individual fields congruent to sales information about products ordered and sold and contained allegedly sensitive information (such as listings of the defendants’ customers).

The defendants objected to allowing access to the database, claiming that the request asked for confidential and sensitive information from its “most important asset” that would give the plaintiff a competitive advantage and that the request amounted to “an obvious fishing expedition.” Reasoning that the information on the database was highly relevant, the court held that “although [the plaintiff’s] request for [the defendant’s] entire OS Commerce database appears facially intrusive, the benefits of allowing [plaintiff] such direct access, under the circumstances of this case, outweigh the burden of producing it, particularly since a protective order is in place. . . . [Access to the database] is more than a mere fishing expedition.” As the Advanced Tactical case showed, database discovery parameters are determined by the facts of the individual case—what is impermissible in one case may be perfectly permissible in the next. So, how do you handle database preservation, collection, and search?

11.4.3 Cooperate Where You Can

Many database discovery disputes escalate unnecessarily because of a lack of technological understanding by counsel on both sides of the aisle.15 Frequently, expensive discovery disputes can be resolved when counsel consult with individuals in information technology about the capabilities for database exports and understand the unique issues associated with database production. When responding to discovery, rather than fight over wholesale database production, explain why wholesale production does not make sense and talk about the various database fields from which you can export to come to a reasonable solution for an export and production. When seeking discovery, ask about the fields in the database and determine what you need. Do not ask for more than you need, and if you want direct access to the database, agree to confidentiality safeguards and reasonable measures in exchange for supervised direct access.

11.4.4 Object to Unreasonable Demands

State and federal courts across the country are attuned to the idea of proportionality in eDiscovery: A $100,000 claim does not typically justify $100,000 in eDiscovery expense. The Federal Rules of Civil Procedure provide for proportionality, and although there is no magic percentage number universally blessed by the courts (i.e., 1% of the claim amount is reasonable for discovery), high-dollar database preservation or production requests that impose an undue burden should be resisted. Try to cooperate and reach a reasonable middle ground, but refuse to comply and lodge the appropriate objections to unreasonable and unduly expensive preservation or production demands. Get before a judge if possible. Failure to object can have disastrous consequences, such as being forced to pay approximately $6 million to comply with a nonparty subpoena.16

11.4.5 Be Specific

The Advanced Tactical defendants were criticized for submitting information to the court about their database in “rather general terms[,]” as it was difficult for the court to make a reasoned decision in a fact-specific analysis with only general information about the information systems involved. This is an extremely common failing in eDiscovery litigation, in that counsel will typically speak in generalities—that is, “the request is too burdensome” as opposed to “the request requires the review of 1,234,522 email documents, 1,234 Word documents, and 2,234 voice mail messages, which would cost approximately X dollars and Y work hours, which is overly burdensome.” Specific information justifying why data should or should not be produced needs to be presented to the court to win eDiscovery disputes. The Advanced Tactical plaintiff, for example, provided a specific reason why they alleged they needed direct database access, arguing that “once it has the database, it can determine whether, as it has reason to suspect, [defendant] is using hidden ‘metatags’ referencing [the plaintiff’s] trademark ‘pepperball’ to drive higher search engine results for that term.” In Advanced Tactical, the plaintiff prevailed, and specific and salient facts must be presented to the Court to enable the court to reach a reasoned decision.

11.4.6 Talk about Database Discovery Early in the Process

In the federal system, parties are required to meet and confer on eDiscovery in their Rule 26(f) conference, and many states have equivalent mandatory meet and confer requirements. Talking about database discovery early on is the best way to address and resolve all the various issues, as frequently counsel shy away from even admitting to having databases that could be subject to search. Hiding from issues or attempting to hide the ball on sources of information is a bad solution that typically ends in increased litigation expense for all parties; do not be afraid to bring up the issue at the outset of discovery. If there is a good-faith disagreement on scope of preservation or production, the court can become involved and resolve the dispute early in the process, before discovery is conducted, to prevent discovery redos or slipups.

11.5 Big Data Digging

Much as Big Data involves the use of predictive analytics to derive insights from large datasets, CAR enables lawyers to use active machine learning algorithms to review large document sets in litigation. Active machine learning is a type of artificial intelligence. When used in legal search, these artificial intelligence algorithms can significantly improve the efficiency and accuracy in the search, review, and classification of electronically stored information.

11.5.1 Driving the CAR Process

Using CAR, an attorney or group of attorneys trains a computer to find documents identified by the attorney or group as a target. The target is typically relevant to a particular lawsuit or legal issue, or some other legal classification, such as privilege. The CAR system acts as a force multiplier for senior attorney judgment, allowing (in the correct case) better recall and precision in the search while reducing overall costs. CAR works well on text-searchable datasets of discrete information, particularly so on email. The classic example of a CAR-amenable dataset is a large number of emails, accompanied by various loose unstructured data collections, such as Word documents, Adobe portable document format (PDF) files, and the like.

First, a subject matter expert (or experts) on the case performs manual reviews of search samples from the dataset. The samples are selected by the attorney’s judgment, and are not random samples. The selections are made with the help of various software search features, including keyword searches and concept searches. Then, statistically random sampling is used to establish a baseline for quality control purposes.

Next, the CAR software’s calculations begin. This is also known as seed set training. Here, the predictive coding software analyzes all of the categorizations made by subject matter experts in the prior steps as long as the documents were designated by them as training documents. Based on this input, coding runs begin by which the software scans all of the data uploaded onto a review platform (the corpus) and assigns a probable value from 0 to 100 to each document in the corpus. A value of 100 represents the highest probability (100%) that the document matches the category trained, such as relevant, or highly relevant; a value of 0 means no likelihood of matching, whereas 50 represents equal likelihood. The software predictions about a document’s categorization are often wrong, sometimes wildly so, depending on the kind of search and data involved. This is why spot-checking and further training are needed for CAR to work properly: It is an iterative process, not a one-step automated review.

After the initial categorization is completed, prediction error corrections are made. Lawyers and paralegals find and correct the computer errors by a variety of methods. The CAR software learns from the corrections. This iterative process is repeated in a virtuous feedback loop that continues until the computer predictions are accurate enough to satisfy discovery standards.

Next, the reasonability of the decision to stop the training is evaluated by an objective quality control test. The test is based on a random sample of all documents to be excluded from the final review for possible production. The exclusion can be based on both category prediction (i.e., probable irrelevant) and probable ranking of document with proportionate cutoffs. The focus is on a search for any false negatives (i.e., relevant documents incorrectly predicted to be irrelevant) that are relevant or otherwise of significance.

The decision is then made on the number of documents to be reviewed by humans for possible production. Typically, a litigant will use CAR processes to winnow out irrelevant documents and will then have humans review the documents identified by the CAR process as relevant. But, this is not always the case, as sometimes a litigant will produce the documents using keyword searches or other methods to spot-check samples of the produced documents.

Finally, after all the documents are reviewed, they are typically spot-checked and produced. The final work includes preparation of a privilege log, which is typically delayed until after production. Also, large-scale productions are frequently done in rolling stages as review is completed.

11.5.2 The Clawback

When using CAR methods, because of the volume of data involved, litigants should use clawback orders as a matter of course to help protect from privilege waiver in large-scale productions. According to District Court Judge Browning, “[t]he train on th[e] concept [of clawback orders] has already left the station, and clawback orders are staples of modern complex commercial litigation.”17

What is a clawback order, you might ask, and why did Judge Browning drop this locomotive metaphor? In the S2 Automation opinion, Judge Browning quoted Professor James Moore’s concise rundown of a clawback order:

Federal courts may enter confidentiality orders providing that disclosure of privileged or protected material in a litigation pending before the court does not constitute waiver in other state or federal proceedings. In suggesting this provision, the Advisory Committee acknowledged that the utility of a confidentiality order in reducing discovery costs is substantially diminished if it provides no protection outside the particular litigation in which the order is entered. Entry of a confidentiality order will prevent nonparties to the litigation from obtaining privileged material produced pursuant to such a confidentiality order. The rule also encompasses situations in which the parties are ordered to provide documents under a “claw-back” or “quick peek” arrangement. These types of arrangements allow the parties to produce documents for review and return without engaging in a privilege review, but without waiver of privilege or work product protection, as a way to avoid the excessive costs of full privilege review and disclosure when large numbers of documents are involved. The rule provides the parties with predicable protection from waiver when responding to a court order for production of documents pursuant to such an arrangement.

A clawback order is essentially a privilege waiver prophylactic. Federal Rule of Evidence 502(d) gives a federal court the power to enter this kind of clawback order, and as the advisory committee notes to 502(d) indicate, the parties do not even have to agree on the clawback order for the court to enter it. The advisory committee also correctly pointed out that such orders “are becoming increasingly important in limiting the costs of privilege review and retention, especially in cases involving electronic discovery[.]” Judge Browning’s case involved a situation where the plaintiff just “[did] not like the overall concept” (for whatever reason), and the court held that “[t]he train on that concept has already left the station, and clawback orders are staples of modern complex commercial litigation.” The lesson? Do not try to stop the clawback train. Instead, get on board, and sleep a little easier about privilege waiver. These types of clawback arrangements can also allow the parties to produce documents for review and return without engaging in a full-scale privilege review, but without waiver of privilege or work product protection, as a way to avoid the excessive costs of full privilege review and disclosure when large numbers of documents are involved.

Although typically lawyers rightly cringe at the thought of cursory privilege review, bottom-line considerations can trump the legal best practice; disclosing privileged documents (without waiving privilege) can be a more attractive situation than spending additional sums on privilege review to ensure withholding of privileged documentation. Even with full-scale privilege review, when millions of emails are in play, accidental production of privileged material is statistically likely, and the clawback rule provides the parties with predicable protection from waiver when responding to a court order for production of documents pursuant to such an arrangement.

11.6 Judicial Acceptance of CAR Methods

When CAR began to see more widespread use a few years ago, litigants occasionally sparred over the use of the technology in lieu of straight eyes-on-every-document review. Then, the first judicial opinion endorsing the use of CAR came in Da Silva Moore v. Publicis Groupe, 287 F.R.D. 102 (S.D.N.Y. 2012). Da Silva was a lengthy and hotly contested case for which the court dug deeply into the inner working of the CAR process.

Since Da Silva, there have been a number of concise opinions or excerpts of state court judges accepting the concept of CAR as the norm. A Virginia state court endorsed the use of CAR, over strenuous objection, in a partially handwritten order in Global Aero. Inc. v. Landow Aviation, L.P., 2012 Va. Cir. LEXIS 50 (Va. Cir. Ct., Apr. 23, 2012).

In the Southern District of New York, Judges Kaplan and Treece have both cited the availability of CAR as part of their analysis in rejecting undue burden objections to discovery requests. As another recent example of CAR acceptance from a December 2012 hearing at which the use of CAR was challenged, Judge Andrews in Delaware stated:

Why isn’t that something—you know, you answered their discovery however you answered it—why isn’t it something where they answer your discovery however they choose to answer it, complying with their professional obligations? How do you get to be involved in the seed batch?18

Thus, the evolving attitude seems to be that CAR is presumptively reasonable—a presumption that the human eye and brain (perhaps undeservedly) currently enjoy. The judiciary has proven aware, at least conceptually, of CAR and its potential application in litigation. The defensibility of the concept of CAR is morphing into a footnote point. Although this does not mean CAR cannot be challenged (or that it should not be challenged in action), challenges only to the general concept of CAR now tend to die quickly on the vine when raised.

11.7 Conclusion

Big Data leaves room for big electronic discovery mistakes. In preservation, make sure to flip the “off” switch for automatic deletion protocols when appropriate; to properly implement, track, and lift litigation holds; and to ensure that collection is handled in a forensically sound and defensible manner. Loop in the lawyers in implementing Big Data analytics to ensure you have considered the legal ramifications (and propriety) of the analytics, as well as automating, where possible, data preservation. Resist overly broad Big Database discovery and be specific in seeking to obtain or block discovery requests. When drowning in Big Data search, use CAR in the right cases to do a better job for a lower cost—and make sure to have a clawback in place in federal litigations involving high-volume exchanges of electronically stored information.

Notes

1. IBM. Apply New Analytics Tools to Reveal New Opportunities. n.d. http://www.ibm.com/smarterplanet/us/en/business_analytics/article/it_business_intelligence.html.

2. Armory is a story about a chimney sweep’s boy and jeweler. Armory was a chimney sweep’s boy who happened on a ring containing a jewel. Not knowing the jewel’s value, Armory took the ring to a jeweler, Delamirie. Delamirie’s assistant removed the gem from the ring, telling Armory he wished to weigh the jewel to determine its worth. The assistant brought back the ring—with the jewel missing from the socket—and told Armory the ring was only worth three half-pence. Armory asked for the ring and jewel back; Delamirie’s assistant apparently “lost” the ring. Delamirie (and his assistant, as his agent) had a duty to preserve the jewel, and as they failed to produce the jewel for inspection, the chief justice instructed the jury that “unless the defendant did produce the jewel, and shew it not to be of the finest water, they should presume the strongest case against him, and make the value of the best jewels the measure of their damages: which they accordingly did.” This sanction is known as an “adverse inference[,]” and spoliation is not a new concept at law.

3. See Pillay v. Millard Refrigerated Services, Inc., 2013 U.S. Dist. LEXIS 72350 (N.D. Ill 2013).

4. Cache La Poudre Feeds, LLC v. Land O’Lakes Farmland Feed, LLC, 244 F.R.D. 614, 629 (D. Colo. 2007), citing In re Cheyenne Software, Inc., 1997 U.S. Dist. LEXIS 2414 (E.D. N.Y. 1997) (awarding monetary sanctions based on defendants’ destruction of documents stored on computer hard drives; noted that information on those hard drives could have been copied to other relatively inexpensive storage media).

5. Samsung Electronics Co., Ltd. v. Rambus, Inc., 439 F.Supp.2d 524, 565 (E.D. Va. 2006). See also Zubulake v. UBS Warburg LLC, 2004 WL 1620866 at *8 (S.D. N.Y. 2004) (“It is not sufficient to notify all employees of a legal hold and expect that the party will then retain and produce all relevant information. Counsel must take affirmative steps to monitor compliance so that all sources of discoverable information are identified and searched.”).

6. See Bayoil, S.A. v. Polembros Shipping Ltd., 196 F.R.D. 479, 483 (S.D. Tex. 2000) (“Notice does not have to be of actual litigation, but can concern ‘potential’ litigation. . . . Otherwise, any person could shred documents to their heart’s content before suit is brought without fear of sanction.”).

7. See EEOC v. Dillon Cos., Inc., 839 F. Supp. 2d 1141, 1143 (D. Colo. 2011).

8. See Ashton v. Knight Transp., Inc., 772 F. Supp. 2d 772, 775 (N.D. Tex. 2011).

9. See Kraft Reinsurance Ir., Ltd., v. Pallets Acquisitions LLC, 843 F. Supp. 2d 1318, 1320 (N.D. Ga. 2011).

10. Combined Ins. Co. of America v. Wiest, 578 F. Supp. 2d 822, 826 (W.D. Va. 2008).

11. See Hynix Semiconductor Inc. v. Rambus, Inc., 591 F. Supp. 1038, 1061 (N.D. Cal. 2006) (noting “Litigation ‘is an everpresent possibility in American life” and that reasonable anticipation requires “more than a possibility” of litigation).

12. Carter Jernigan and Behram F.T. Mistree. Gaydar: Facebook Friendships Expose Sexual Orientation. First Monday, 14(10) (October 5, 2009). http://firstmonday.org/article/view/2611/2302.

13. In re Ford Motor Co., 345 F. 3d 1315, 1316-17 (11th Cir. 2003).

14. Advanced Tactical Ordnance Sys., LLC v. Real Action Paintball, Inc., 2013 U.S. Dist. LEXIS 25022 (N.D. Ind. 2013).

15. See, for example, Mills v. Billington, 2013 U.S. Dist. LEXIS 118284 (D.D.C. 2013) (noting in addressing database discovery issues that “electronic discovery issues in this case have been unnecessarily complicated, the Plaintiffs identified what they sought but failed to do so with precision, and the Defendant expressed an inability to understand Plaintiffs’ request and failed to inform the Court or the Plaintiffs when the data was no longer preserved in its possession.”).

16. See In Re Fannie Mae Securities Litigation, 2009 U.S. App. LEXIS 9 (D.C. Cir. 2009).

17. S2 Automation LLC v. Micron Technology, Inc., 2012 U.S. Dist. LEXIS 120097 (D.N.M. 2012).

18. Robocast, Inc. v. Apple, Inc., No. 11-235 (D. Del.) December 5, 2012, transcript at 16:4–8.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset