Comprehend / Client / classify_document
classify_document#
- Comprehend.Client.classify_document(**kwargs)#
- Creates a new document classification request to analyze a single document in real-time, using a previously created and trained custom model and an endpoint. - You can input plain text or you can upload a single-page input document (text, PDF, Word, or image). - If the system detects errors while processing a page in the input document, the API response includes an entry in - Errorsthat describes the errors.- If the system detects a document-level error in your input document, the API returns an - InvalidRequestExceptionerror response. For details about this exception, see Errors in semi-structured documents in the Comprehend Developer Guide.- See also: AWS API Documentation - Request Syntax - response = client.classify_document( Text='string', EndpointArn='string', Bytes=b'bytes', DocumentReaderConfig={ 'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT'|'TEXTRACT_ANALYZE_DOCUMENT', 'DocumentReadMode': 'SERVICE_DEFAULT'|'FORCE_DOCUMENT_READ_ACTION', 'FeatureTypes': [ 'TABLES'|'FORMS', ] } ) - Parameters:
- Text (string) – The document text to be analyzed. If you enter text using this parameter, do not use the - Bytesparameter.
- EndpointArn (string) – - [REQUIRED] - The Amazon Resource Number (ARN) of the endpoint. For information about endpoints, see Managing endpoints. 
- Bytes (bytes) – - Use the - Bytesparameter to input a text, PDF, Word or image file. You can also use the- Bytesparameter to input an Amazon Textract- DetectDocumentTextor- AnalyzeDocumentoutput file.- Provide the input document as a sequence of base64-encoded bytes. If your code uses an Amazon Web Services SDK to classify documents, the SDK may encode the document file bytes for you. - The maximum length of this field depends on the input document type. For details, see Inputs for real-time custom analysis in the Comprehend Developer Guide. - If you use the - Bytesparameter, do not use the- Textparameter.
- DocumentReaderConfig (dict) – - Provides configuration parameters to override the default actions for extracting text from PDF documents and image files. - DocumentReadAction (string) – [REQUIRED] - This field defines the Amazon Textract API operation that Amazon Comprehend uses to extract text from PDF files and image files. Enter one of the following values: - TEXTRACT_DETECT_DOCUMENT_TEXT- The Amazon Comprehend service uses the- DetectDocumentTextAPI operation.
- TEXTRACT_ANALYZE_DOCUMENT- The Amazon Comprehend service uses the- AnalyzeDocumentAPI operation.
 
- DocumentReadMode (string) – - Determines the text extraction actions for PDF files. Enter one of the following values: - SERVICE_DEFAULT- use the Amazon Comprehend service defaults for PDF files.
- FORCE_DOCUMENT_READ_ACTION- Amazon Comprehend uses the Textract API specified by DocumentReadAction for all PDF files, including digital PDF files.
 
- FeatureTypes (list) – - Specifies the type of Amazon Textract features to apply. If you chose - TEXTRACT_ANALYZE_DOCUMENTas the read action, you must specify one or both of the following values:- TABLES- Returns information about any tables that are detected in the input document.
- FORMS- Returns information and the data from any forms that are detected in the input document.
 - (string) – - Specifies the type of Amazon Textract features to apply. If you chose - TEXTRACT_ANALYZE_DOCUMENTas the read action, you must specify one or both of the following values:- TABLES- Returns additional information about any tables that are detected in the input document.
- FORMS- Returns additional information about any forms that are detected in the input document.
 
 
 
 
- Return type:
- dict 
- Returns:
- Response Syntax - { 'Classes': [ { 'Name': 'string', 'Score': ..., 'Page': 123 }, ], 'Labels': [ { 'Name': 'string', 'Score': ..., 'Page': 123 }, ], 'DocumentMetadata': { 'Pages': 123, 'ExtractedCharacters': [ { 'Page': 123, 'Count': 123 }, ] }, 'DocumentType': [ { 'Page': 123, 'Type': 'NATIVE_PDF'|'SCANNED_PDF'|'MS_WORD'|'IMAGE'|'PLAIN_TEXT'|'TEXTRACT_DETECT_DOCUMENT_TEXT_JSON'|'TEXTRACT_ANALYZE_DOCUMENT_JSON' }, ], 'Errors': [ { 'Page': 123, 'ErrorCode': 'TEXTRACT_BAD_PAGE'|'TEXTRACT_PROVISIONED_THROUGHPUT_EXCEEDED'|'PAGE_CHARACTERS_EXCEEDED'|'PAGE_SIZE_EXCEEDED'|'INTERNAL_SERVER_ERROR', 'ErrorMessage': 'string' }, ] } - Response Structure - (dict) – - Classes (list) – - The classes used by the document being analyzed. These are used for multi-class trained models. Individual classes are mutually exclusive and each document is expected to have only a single class assigned to it. For example, an animal can be a dog or a cat, but not both at the same time. - (dict) – - Specifies the class that categorizes the document being analyzed - Name (string) – - The name of the class. 
- Score (float) – - The confidence score that Amazon Comprehend has this class correctly attributed. 
- Page (integer) – - Page number in the input document. This field is present in the response only if your request includes the - Byteparameter.
 
 
- Labels (list) – - The labels used the document being analyzed. These are used for multi-label trained models. Individual labels represent different categories that are related in some manner and are not mutually exclusive. For example, a movie can be just an action movie, or it can be an action movie, a science fiction movie, and a comedy, all at the same time. - (dict) – - Specifies one of the label or labels that categorize the document being analyzed. - Name (string) – - The name of the label. 
- Score (float) – - The confidence score that Amazon Comprehend has this label correctly attributed. 
- Page (integer) – - Page number where the label occurs. This field is present in the response only if your request includes the - Byteparameter.
 
 
- DocumentMetadata (dict) – - Extraction information about the document. This field is present in the response only if your request includes the - Byteparameter.- Pages (integer) – - Number of pages in the document. 
- ExtractedCharacters (list) – - List of pages in the document, with the number of characters extracted from each page. - (dict) – - Array of the number of characters extracted from each page. - Page (integer) – - Page number. 
- Count (integer) – - Number of characters extracted from each page. 
 
 
 
- DocumentType (list) – - The document type for each page in the input document. This field is present in the response only if your request includes the - Byteparameter.- (dict) – - Document type for each page in the document. - Page (integer) – - Page number. 
- Type (string) – - Document type. 
 
 
- Errors (list) – - Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors. - (dict) – - Text extraction encountered one or more page-level errors in the input document. - The - ErrorCodecontains one of the following values:- TEXTRACT_BAD_PAGE - Amazon Textract cannot read the page. For more information about page limits in Amazon Textract, see Page Quotas in Amazon Textract. 
- TEXTRACT_PROVISIONED_THROUGHPUT_EXCEEDED - The number of requests exceeded your throughput limit. For more information about throughput quotas in Amazon Textract, see Default quotas in Amazon Textract. 
- PAGE_CHARACTERS_EXCEEDED - Too many text characters on the page (10,000 characters maximum). 
- PAGE_SIZE_EXCEEDED - The maximum page size is 10 MB. 
- INTERNAL_SERVER_ERROR - The request encountered a service issue. Try the API request again. 
 - Page (integer) – - Page number where the error occurred. 
- ErrorCode (string) – - Error code for the cause of the error. 
- ErrorMessage (string) – - Text message explaining the reason for the error. 
 
 
 
 
 - Exceptions - Comprehend.Client.exceptions.InvalidRequestException
- Comprehend.Client.exceptions.ResourceUnavailableException
- Comprehend.Client.exceptions.TextSizeLimitExceededException
- Comprehend.Client.exceptions.InternalServerException