Amazon Textract Is Able To Use Machine Learning To Extract Text And Data From Most Documents

By Amit Chowdhry ● May 31, 2019
  • Amazon announced that its Textract platform is able to automatically extract text and data from tables and forms in virtually any document using machine learning
  • Amazon Textract is able to accurately process millions of document pages in just a few hours

Amazon has recently announced that its Textract platform — which is able to automatically extract text and data, including from tables and forms in virtually any document using machine learning — is now generally available. Some of Amazon’s customers and partners that are using Textract include The Globe and Mail, MET Office, PwC, Healthfirst, UiPath, Teradact, Ripcord, Kablamo, Vidado, BluePrism, and Alfresco.

Amazon Textract is able to go beyond simple optical character recognition (OCR) for identifying the contents of fields in forms, information stored in tables, and the context in which the information is presented such as names or social security numbers from tax form or product SKUs / quantity in a warehouse from an inventory report. The text and data extracted can be easily used for building smart searches on large archives of documents or it can be loaded into a database for use by applications like accounting, auditing, and compliance software.

Amazon Textract’s API supports multiple image formats like scans, PDFs, and photos. And customers can use it with database and analytics services such as Amazon Elasticsearch Service, Amazon DynamoDB, Amazon Athena, and other machine learning services such as Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker for deriving deeper meaning from the extracted text and data.

While many companies are able to extract text and data from files like contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software, it is often a time-consuming and inaccurate process that produces an output requiring extensive post-processing before it can be put in a format that is usable by other applications. This is due to OCR technologies not being able to recognize common layouts like forms and tables and only generates a lengthy and often inaccurate text dump.

Amazon Textract also makes it easy for customers to accurately process millions of document pages in just a few hours. This significantly lowers document processing costs and allows customers to focus on deriving business value from their text and data rather than wasting time and effort on post-processing. Plus the results are delivered via an API that can be easily accessed and used without requiring any machine learning experience.

“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required. Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to help customers derive deeper meaning from the extracted text and data,” said Amazon Machine Learning VP Swami Sivasubramanian. “In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions.”

Amazon Textract takes scanned files stored in an Amazon S3 bucket, reads it, and returns data in the form of JSON text annotated with the page number, section, form labels, and data types. And this data can then be used for a range of applications such as generating smart search indexes, redacting text in a massive collection of forms, creating automated loan approval workflows, using the data for regulatory compliance, and flagging fraud risk for insurance claims.

And customers are able to load the data into business software like spreadsheets, databases, and payroll systems, or analyze and query the data using Amazon ElasticSearch, Amazon DynamoDB, Amazon Redshift, or Amazon Athena.  Amazon Textract is available now in US East (Ohio), US East (N. Virginia), US West (Oregon), EU (Ireland) and will expand to additional regions in the coming year.

“As a news media company, we rely on many PDF or scanned-source documents such as FOIs (freedom of information requests) that have important information contained in tables that we previously couldn’t access,” added The Globe and Mail’s Managing Director of Digital and Data Science Mike O’Neill. “These documents have been under-utilized because journalists were not able to access them easily or didn’t know they existed. Using Amazon Textract, we are able to extract information from tables in PDFs and easily output that data to CSV and offer easy access to these documents by making them available for search queries by our journalists. This increases efficient access to information for our journalist by tenfold.”

UK-based national weather service Met Office is planning to use AmazonTextract for digitizing weather data. “We hope to use AmazonTextract to digitize millions of historical weather observations from document archives,” added Met Office’s Climate Scientist Philip Brohan. “Making these observations available to science will improve our understanding of climate variability and change.”

“At PwC, we work to provide our customers with intelligent automation tools that help transform previously manual processes. We’ve integrated Amazon Textract into our solution for the pharmaceutical industry to automate document processing for various FDA forms like MedWatch and CIOMS,” explained Siddhartha Bhattacharya of PwC. “Previously, people would manually review, edit, and process these forms, each one taking hours. Amazon Textract has proven to be the most efficient and accurate OCR solution available for these forms, extracting all of the relevant information for review and processing, and reducing time spent from hours to down to minutes.”

Not-for-profit managed care organization Healthfirst is one of the fastest growing health plans in New York with more than 1.4 million diverse members and a network of over 35,000 providers and 4,500 employees.

“At Healthfirst, we are building data pipelines to turn scanned medical charts into useful clinical information to improve care coordination, drive quality outcomes, and ensure appropriate reimbursement for members under our coverage,” noted Healthfirst’s Chief Analytics Officer Steve Prewitt. “We use Amazon Textract and Amazon Comprehend Medical to glean real value from unstructured data sources in an efficient way, resulting in revenue savings 10-20 times more than our usual downstream operation. By scaling up to analyze over 50,000 charts, we can find undocumented diagnoses and refer around 5,000 members for the care management they need.”

Informed is known for automating how financial institutions originate loans and open bank accounts.

“We have already used Amazon Textract to analyze tens of thousands of loan documents on behalf of financial institutions, and our own software-as-a-service offering has been enhanced by the service, enabling us to identify 95% of the defects in loan application packages and help banks reduce their manual data entry,” Informed founder and CEO Justin Wickett commented. “Using Amazon Textract, our software gives financial institutions real-time visibility into an applicant’s income based off of their pay stubs, bank statements, tax returns, and other financial documents. We plan to expand the types of documents we analyze using Amazon Textract in order to enable financial institutions to take advantage of our machine learning models and bring real-time decision-making efficiency to today’s slow and manual process.”

Candor, a company that is disrupting the time-consuming processes that impact the mortgage industry, has been known for using OCR for extracting data from lender-required documents to verify information.

“We use OCR to extract data from a wide variety of lender-required documents to verify income, assets, property value, and more. Until now, the best OCR solution read one page at the rate of 38.4 seconds, but Amazon Textract achieves this in a fraction of that time,” said Candor CEO and founder Tom Showalter pointed out. “We’ve been able to use Textract to accurately read complex, diverse documents such as bank statements, pay stubs, and tax documents without additional training or machine learning expertise, allowing our clients to underwrite and close a loan in days, as opposed to weeks.”

Robotic Process Automation vendor UiPath helps organizations efficiently automate business processes. Here is what UiPath’s chief product officer said about Amazon Textract:

“Amazon Textract will further differentiate UiPath’s robotic process automation platform by enhancing UiPath’s document understanding capabilities, enabling our customers to unlock critical business data from documents, transform that data into actionable business insights, and deliver those insights into line-of-business and operational systems.”

TeraDact, a company that allows customers to transform stored images and paper documents into privacy-compliant into usable digital formats at scale has found Amazon Textract to be useful in their business processes.

“Amazon Textract’s smart docs platform feeds TeraDact’s patented redaction services to automatically remove and secure sensitive data. TeraDact customers can permanently remove this data so that it can never be recovered or opt to replace sensitive data with patented tokens which can be recovered by individuals with the appropriate permissions. This is particularly useful in complying with government mandates surrounding individual data privacy such as GDPR,” revealed TeraDact COO Tom Trobridge.

Ripcord digitizes and extracts knowledge from paper documents using vision-guided robotics, machine learning, and advanced AI has been tapping into Amazon Textract for a number of its services.

“We’ve had tremendous success utilizing Amazon Textract to augment our advanced entity extraction to benefit many industries and uncover $4 billion in new pay. We look forward to expanding our use of Amazon Textract across financial and government services, healthcare and legal,” mentioned Ripcord founder and iCEO Alex Fielding.

Blue Prism — a company that builds Robotic Process Automation software for providing businesses and organizations with agile virtual workforces — is now using Amazon Textract for analyzing data from various document types.

“Blue Prism’s connected-RPA can automate and perform mission-critical processes, allowing customers the freedom to focus on more creative, meaningful work. By using Amazon Textract, we’ve given our digital workforce another powerful tool for automation. Amazon Textract accurately analyzes data from various document types using machine learning, which enhances the digital transformation journey for our customers. Using additional AWS AI services like Amazon Comprehend and Amazon Rekognition, we can tackle challenges from added secure customer authentication processes to fraud detection capabilities. The intelligence and flexibility of Amazon Textract’s form data extraction can elevate OCR to new levels in industries like financial services, retail, manufacturing and transportation to name a few,” stated Blue Prism CTO and co-founder David Moss.

Exit mobile version